Skip to content

Pattern: RAG (Retrieval-Augmented Generation)

Quick facts

  • Category: AI / LLM-Integrated Systems
  • Maturity: Adopt
  • Typical team size: 2-4 engineers
  • Typical timeline to MVP: 3-6 weeks
  • Last reviewed: 2026-05-02 by Architecture Team

1. Context

Use this pattern when:

  • You need an LLM to answer questions over a private or frequently updated corpus that was not in its training data
  • Answers must be traceable to a source — citations reduce hallucination risk and build user trust
  • The knowledge base changes often enough that baking it into model weights via fine-tuning would produce stale results within months
  • The corpus spans thousands to millions of documents: internal wikis, support tickets, contracts, research papers, product documentation

Do NOT use this pattern when:

  • The entire knowledge base fits comfortably in the model's context window (< 100k tokens, ~200 pages) — just pass it all in; retrieval adds complexity with no benefit
  • The task requires reasoning over structured data (tables, metrics) — use a Text-to-SQL pattern or direct database queries instead
  • Real-time freshness under 30 seconds is required — vector index update latency is typically minutes
  • You have not yet tested whether basic prompting with a large context window already solves the problem — it often does

2. Problem it solves

Organisations have vast stores of proprietary knowledge — internal documentation, past decisions, product manuals, support histories — that an LLM was never trained on and cannot access. Feeding this knowledge into the model through fine-tuning is slow and expensive, produces stale results as soon as the corpus updates, and does not give the model the ability to cite its sources. RAG solves this by retrieving only the most relevant passages at query time and including them in the prompt, keeping the LLM grounded in current, auditable content.

3. Solution overview

System context (C4 Level 1)

flowchart LR
    User((User)) --> App[RAG Application]
    Editor((Content Editor)) -->|publishes docs| DocStore[(Document Store\nS3 / Confluence / Notion)]
    DocStore -->|triggers ingestion| Ingest[Ingestion Pipeline]
    Ingest --> VecDB[(Vector Store\nPostgres + pgvector)]
    App --> VecDB
    App --> LLM[LLM Provider\nAnthropic / OpenAI]
    App --> Obs[Observability\nLangfuse]

Container view (C4 Level 2)

flowchart TB
    subgraph Ingestion Pipeline
        Loader[Document Loader\nS3 / Confluence / PDF parser]
        Chunker[Chunker\nrecursive 512-token chunks]
        Embedder[Embedder\ntext-embedding-3-small]
        VecWrite[(pgvector write)]
    end
    subgraph Query Pipeline
        QueryAPI[Query API\nFastAPI]
        QEmbed[Query Embedder\nsame model as ingestion]
        Retriever[Retriever\ntop-k cosine search]
        Reranker[Reranker\nCohere Rerank — optional]
        CtxBuilder[Context Assembler\nprompt template + passages]
        LLMCall[LLM Client\nstreaming]
    end
    subgraph Storage
        VecDB[(PostgreSQL + pgvector\nembeddings + metadata)]
        DocMeta[(Document metadata\npath, version, acl)]
    end
    subgraph Ops
        Obs[Langfuse\ntrace every query]
        Eval[Eval Harness\npytest + RAGAS]
    end

    Loader --> Chunker --> Embedder --> VecWrite --> VecDB
    QueryAPI --> QEmbed --> Retriever
    Retriever --> VecDB
    Retriever --> Reranker --> CtxBuilder
    CtxBuilder --> LLMCall
    LLMCall --> QueryAPI
    QueryAPI --> Obs
    VecDB --> DocMeta

4. Technology stack

Layer Primary choice Alternatives Notes
Language Python 3.12+ TypeScript (Node.js) Python for the ingestion pipeline and API; TypeScript if the frontend team owns the whole stack
LLM Anthropic Claude 3.5 Sonnet OpenAI GPT-4o, Google Gemini 1.5 Pro See ADR-0006; Sonnet balances quality and cost for RAG generation
Embedding model OpenAI text-embedding-3-small Cohere embed-v3, Voyage voyage-3 text-embedding-3-small is cheap, fast, and accurate for English; Voyage voyage-3 for highest retrieval accuracy
Vector store PostgreSQL + pgvector Pinecone, Qdrant, Weaviate See ADR-0004; pgvector co-locates with your existing Postgres
Orchestration Hand-rolled Python LlamaIndex, LangChain See ADR-0005; LlamaIndex is acceptable for RAG-specific pipelines
Chunking Recursive character splitting (512 tok / 50 overlap) Semantic chunking, document-structure-aware The boring default works well; tune chunk size empirically on your eval set, not upfront
Reranker Cohere Rerank API Cross-encoder (local), None Reranking significantly improves precision at the cost of ~100ms + API fee; add after baseline pipeline works
Observability Langfuse LangSmith, Helicone, Arize Phoenix Langfuse is open-source, self-hostable, and tracks traces + eval scores in one place
Evaluation Custom pytest + RAGAS Braintrust, DeepEval Always build an eval harness before tuning retrieval parameters — you need a signal

5. Non-functional characteristics

Concern Profile
Scalability pgvector handles ~1M document chunks with sub-100ms retrieval on a pg_ivfflats index; migrate to a dedicated vector DB (Qdrant, Pinecone) above ~10M chunks or if multi-tenancy requires namespace isolation. Ingestion pipeline scales horizontally by partitioning documents across workers.
Availability target 99.9% — same as the underlying Postgres. LLM API availability (~99.5% for Anthropic/OpenAI) is typically the binding constraint; implement a fallback error message rather than crashing.
Latency target p95 < 2 s for a complete RAG response: embedding ~50 ms, retrieval ~100 ms, reranking ~150 ms, LLM generation ~1–1.5 s (streaming hides this). Time-to-first-token < 600 ms with streaming enabled.
Security posture Enforce document-level access control before retrieval — never return chunks the querying user is not permitted to read. Store ACL metadata alongside each chunk in the vector store; filter on it at query time. Treat LLM API calls as data-exfiltration paths: sanitise chunks before including them in prompts.
Data residency Document chunks and their embeddings live in your Postgres instance. Every query sends the retrieved passages to the LLM API — ensure this is permissible under your data classification policy before deploying to production.
Compliance fit GDPR ✓ with EU region deployment; right-to-erasure requires deleting both source documents and their embeddings (write a clean-up job). HIPAA ✓ with BAA from LLM provider (Anthropic and OpenAI both offer BAAs). SOC 2 ✓ with audit log of every query + source chunks returned.

6. Cost ballpark

Indicative monthly USD cost. LLM token spend is the dominant variable; retrieval infrastructure is cheap.

Scale Documents in corpus Monthly cost Cost drivers
Small < 10,000 $50 - $250 One-time embedding ingestion cost + ongoing query token spend + Postgres hosting
Medium 10k - 500k $500 - $3,000 Larger Postgres instance, higher query volume, optional Cohere Rerank API
Large 500k+ $3,000 - $15,000 Dedicated vector DB if migrated from pgvector, high LLM token volume, Langfuse observability at scale

7. LLM-assisted development fit

Aspect Rating Notes
Ingestion pipeline (load, chunk, embed, upsert) ★★★★★ Excellent — patterns are extremely well-represented; generate the skeleton and iterate.
Retrieval and prompt assembly ★★★★ Good; verify top-k values, prompt template wording, and chunk overlap on your own eval set.
Retrieval quality tuning (threshold, reranking, hybrid search) ★★★ Knows the levers but optimal values require empirical evaluation on your specific corpus. Do not trust defaults.
Evaluation harness and test set generation ★★★ Can generate plausible question-answer pairs for eval; domain experts must validate ground-truth answers.
Architecture decisions Don't outsource. Use ADRs.

Recommended workflow: Build an eval set of 50–100 question-answer pairs from your corpus before writing any retrieval code. Use it to measure every change. Generate ingestion code with the LLM; hand-tune the chunk size and retrieval parameters against eval scores.

8. Reference implementations

  • Public reference: langchain-ai/rag-from-scratch — companion repo to LangChain's RAG from Scratch YouTube series; 18 notebooks covering naive RAG through advanced techniques (200 OK ✓)
  • Public reference: run-llama/llama_index — LlamaIndex framework; docs/examples/ contains end-to-end RAG pipeline examples for dozens of vector stores (200 OK ✓)
  • Public reference: pgvector/pgvector — the pgvector extension itself; README covers index types, distance functions, and performance tuning (200 OK ✓)
  • Internal case study: Add your anonymised internal example here

10. Known risks & gotchas

  • Chunk boundaries split semantic units — A sentence split across two chunks loses meaning; neither chunk retrieves well. Mitigation: use recursive character splitting with a meaningful overlap (50 tokens); for structured documents (contracts, specs), write a document-structure-aware chunker that respects heading and paragraph boundaries.
  • Access control leakage through retrieval — The retrieval step returns chunks without checking if the querying user is authorised to read the source document. Mitigation: store the source document's ACL as metadata on every chunk and apply a WHERE acl_allows($user_id) filter in the similarity search — not as a post-retrieval step.
  • Stale embeddings after document updates — If a document is updated but its old chunks are not removed, the index contains contradictory information. Mitigation: implement a delete-and-re-index on every document update keyed on a stable document ID; track the last-embedded version hash.
  • Retrieval precision vs. recall mismatch — A high top-k returns noisy context that confuses the model; a low top-k misses the relevant passage. Mitigation: measure recall@k on your eval set; add a reranker to improve precision after retrieval rather than tuning top-k alone.
  • LLM answers outside the retrieved context — The model supplements retrieved chunks with parametric knowledge, producing confident hallucinations. Mitigation: include an explicit instruction in the system prompt ("Answer only from the provided context; if the answer is not present, say so") and verify this instruction is honoured during eval.