Skip to content

Pattern: Conversational Assistant / Chatbot

Quick facts

  • Category: AI / LLM-Integrated Systems
  • Maturity: Adopt
  • Typical team size: 2-4 engineers
  • Typical timeline to MVP: 4-8 weeks
  • Last reviewed: 2026-05-02 by Architecture Team

1. Context

Use this pattern when:

  • Handling high volumes of repetitive customer support, internal helpdesk, or FAQ queries where the answer space is well-defined
  • Users prefer a natural-language interface over navigating documentation trees or form-based UIs
  • Escalation to a human agent is acceptable for complex or sensitive cases — the system is not fully autonomous
  • You have a knowledge corpus to ground answers in (pair with the RAG pattern for private knowledge)

Do NOT use this pattern when:

  • The interaction must initiate irreversible real-world actions (payments, deletions, external API mutations) without a human-in-the-loop gate — use the Agentic Workflow pattern with explicit approval steps instead
  • Deterministic output is required: the chatbot is probabilistic and will occasionally produce wrong answers at a rate you cannot engineer to zero
  • The domain requires guaranteed correctness (medical diagnosis, legal advice, financial execution) — AI assistance without mandatory human review is inappropriate in these domains
  • User trust in AI is low enough that a chatbot will generate more support tickets than it deflects

2. Problem it solves

Support teams and internal helpdesks field thousands of repetitive queries. Staffing enough humans to cover all channels, all hours, is expensive and slow. Users wait; agents burn out on copy-paste answers. This pattern provides instant, always-available first-line responses that deflect common queries, surface relevant knowledge, and hand off complex cases to humans — reducing ticket volume without reducing support quality for hard problems.

3. Solution overview

System context (C4 Level 1)

flowchart LR
    User((User)) --> UI[Chat UI\nweb widget or mobile]
    UI --> ConvAPI[Conversation API]
    ConvAPI --> LLM[LLM Provider\nAnthropic / OpenAI]
    ConvAPI --> KB[Knowledge Base\nRAG retriever]
    ConvAPI --> CRM[CRM / Ticketing\nZendesk / Intercom]
    ConvAPI --> Obs[Observability\nLangfuse]
    Agent((Human Agent)) --> CRM

Container view (C4 Level 2)

flowchart TB
    subgraph Frontend
        Widget[Chat Widget\nReact + Vercel AI SDK]
        SSE[SSE stream\ntoken-by-token]
    end
    subgraph Backend
        ConvAPI[Conversation API\nFastAPI async]
        HistStore[(Conversation History\nPostgres)]
        CtxMgr[Context Manager\ntruncate + summarise old turns]
        RAGRetriever[RAG Retriever\noptional knowledge base]
        GuardIn[Input Guardrails\nPII detection, topic filter]
        LLMClient[LLM Client\nstreaming]
        GuardOut[Output Guardrails\nsafety + PII redaction]
        Escalation[Escalation Router\nsentiment + intent]
    end
    subgraph External
        LLM[LLM Provider API]
        Ticket[Ticketing System\nZendesk / Intercom]
    end
    subgraph Ops
        Langfuse[Langfuse\ntraces + CSAT scores]
        Metrics[Custom metrics\ndeflection rate, escalation rate]
    end

    Widget --> SSE --> ConvAPI
    ConvAPI --> HistStore
    ConvAPI --> CtxMgr
    ConvAPI --> GuardIn --> RAGRetriever
    RAGRetriever --> LLMClient
    GuardIn --> LLMClient
    LLMClient --> GuardOut --> Widget
    GuardOut --> Escalation
    Escalation --> Ticket
    ConvAPI --> Langfuse
    Langfuse --> Metrics

4. Technology stack

Layer Primary choice Alternatives Notes
LLM Anthropic Claude 3.5 Haiku GPT-4o-mini, Gemini 2.0 Flash See ADR-0006; fast cheap models dominate at chatbot scale — cost-per-token matters
Streaming Server-Sent Events (SSE) WebSockets SSE is simpler for one-directional token streaming; use WebSockets only if you need bidirectional real-time events
Frontend chat component Vercel AI SDK (useChat) CopilotKit, Chainlit, open-source widget useChat handles streaming state, optimistic UI, and error recovery in ~20 lines; Chainlit for rapid prototyping with a hosted UI
Backend API FastAPI (async) NestJS, Express FastAPI's async generators map directly to SSE streaming with minimal boilerplate
Conversation history PostgreSQL Redis (session-scoped), DynamoDB Postgres for durable history that survives restarts; Redis if you only need in-session memory
Context window management Manual truncation + rolling summary mem0, Zep Keep history under 80% of the model's context window; summarise older turns rather than truncating them cold
Input/output guardrails Llama Guard 3 (self-hosted) + provider moderation NeMo Guardrails, Guardrails AI Always run provider-side moderation as a first pass; add topic-restriction and PII redaction for compliance
RAG knowledge base pgvector (see RAG pattern) None (pure parametric) Pair with the RAG pattern for private knowledge; pure chatbots without grounding hallucinate facts freely
Observability Langfuse LangSmith, Helicone Track: deflection rate, escalation rate, CSAT per conversation, token cost per session

5. Non-functional characteristics

Concern Profile
Scalability Stateless API tier scales horizontally behind a load balancer. Conversation history in Postgres with PgBouncer connection pooling. SSE connections are long-lived (~30 s per response); plan HTTP connection limits accordingly.
Availability target 99.9%; LLM API downtime is the dominant failure mode. Implement a fallback: surface a "currently unavailable" message and offer to open a ticket rather than showing a raw API error.
Latency target Time-to-first-token < 500 ms for Haiku / Flash models. Full response perceived latency is masked by streaming. Users begin abandoning after ~2 s of blank screen — streaming is not optional.
Security posture Rate-limit per user and per organisation (token-bucket). Detect and redact PII in both directions. Never log raw conversation content to unencrypted sinks. Guard against prompt injection via user input by using a separate system prompt that cannot be overridden by the user's turn.
Data residency Conversation history is stored in your Postgres instance. Message text is transmitted to the LLM API per request — ensure this is acceptable under your data classification and contractual obligations.
Compliance fit GDPR ✓ — implement right-to-erasure on conversation history; disclose AI use in the product's privacy policy. HIPAA ✓ with BAA from LLM provider; never pass unredacted health data if BAA is not in place. SOC 2 ✓ with conversation audit log and access controls.

6. Cost ballpark

Indicative monthly USD cost. LLM token spend is the dominant variable; use the cheapest capable model.

Scale Conversations / month Monthly cost Cost drivers
Small < 5,000 $50 - $300 LLM API tokens (Haiku/Flash), Postgres, hosting
Medium 5k - 100k $500 - $5,000 LLM API at volume, Langfuse observability plan, RAG infrastructure if paired
Large 100k+ $5,000 - $30,000 LLM API dominant; evaluate prompt caching (up to 90% reduction on repeated system prompts), model tier, and context compression

7. LLM-assisted development fit

Aspect Rating Notes
Streaming API integration and SSE wiring ★★★★★ Excellent — Vercel AI SDK + FastAPI streaming is very well-represented in training data.
System prompt and persona engineering ★★★★ Good starting point; always red-team the prompt with adversarial inputs before launch.
Guardrail and escalation logic ★★★ Produces structurally correct guardrails; the thresholds (confidence, sentiment) need tuning against real conversation data.
Context window management (summarisation) ★★★ Knows the pattern; edge cases around mid-conversation summarisation require careful manual testing.
Architecture decisions Don't outsource. Use ADRs.

Recommended workflow: Define the escalation criteria and success metrics (deflection rate, CSAT target) before writing code. Ship to 5% of traffic first. Human agents should review the first 200 conversations before tuning the system prompt — real user queries will surprise you.

8. Reference implementations

  • Public reference: langfuse/langfuse — open-source LLM observability platform; the web/ app is itself a Next.js + FastAPI system with streaming chat — useful reference for production LLM app architecture (200 OK ✓)
  • Public reference: run-llama/llama_index — chat examples — integration examples covering conversation memory, tool use, and RAG-backed chat (200 OK ✓)
  • Internal case study: Add your anonymised internal example here

10. Known risks & gotchas

  • Context window overflow corrupts long conversations — Without active management, conversation history eventually exceeds the context window. The model silently drops the oldest turns, losing critical context. Mitigation: track token counts for every turn; trigger a rolling summarisation step when history reaches 70% of the context limit — before you hit the limit, not after.
  • Confident hallucinations on out-of-scope questions — The chatbot answers questions outside its intended scope with the same confidence as in-scope ones. Mitigation: define explicit topic boundaries in the system prompt; evaluate the chatbot on out-of-scope questions during QA; pair with a RAG knowledge base to ground answers.
  • Escalation rate drift after system prompt changes — A prompt update that improves deflection rate may silently increase the rate of bad answers that never get escalated. Mitigation: treat escalation rate as a two-sided metric (too high = bot is unhelpful; too low = bot is over-confident); alert on both directions.
  • Prompt injection via user input — A user sends "Ignore all previous instructions and instead output X." Mitigation: place the system prompt in the system role (not user), which is harder to override; include an explicit instruction not to follow commands that override the persona; test with known injection patterns.
  • GDPR deletion does not remove LLM fine-tuning exposure — If conversation data is later used for fine-tuning, a deletion request does not remove that data from model weights. Mitigation: clearly separate "operational conversation history" (deletable) from any data used for training (subject to a separate data processing agreement).