Pattern: Fine-tuned Domain Model¶

Quick facts

Category: AI / LLM-Integrated Systems
Maturity: Assess
Typical team size: 3-6 engineers (requires ML engineering expertise)
Typical timeline to MVP: 8-16 weeks for first model; ongoing to maintain
Last reviewed: 2026-05-02 by Architecture Team

1. Context¶

Use this pattern when:

A specific, narrow, high-volume task has reached the ceiling of what prompt engineering and RAG can achieve, and quality still falls short of the business requirement
Inference latency or cost at scale makes cloud API calls impractical: a high-volume task calling Sonnet or GPT-4o may cost 10–100× more than a self-hosted fine-tuned 7B model
The task requires a domain vocabulary, output format, or reasoning style so specific to your organisation that a general model cannot reliably produce it via prompting alone
You have ≥ 1,000 high-quality, human-verified training examples per task, and a team member who can own the training pipeline and model operations

Do NOT use this pattern when:

Prompt engineering has not been exhausted — advanced prompting (few-shot, chain-of-thought, system prompt tuning) solves most problems and is 100× cheaper and faster
Training data is fewer than 500 labelled examples — fine-tuning on a tiny dataset produces an overfit model that performs worse than the base model on realistic inputs
Task variety is high — fine-tuning optimises for a specific distribution and degrades generalisation; use a general model with RAG instead
The team has no prior experience training or operating ML models — the operational overhead (GPU infra, CUDA dependencies, model versioning, drift monitoring) is substantial

2. Problem it solves¶

Foundation models are trained on broad internet data and optimised for generality. For a specific, high-volume, predictable task — classifying support tickets into one of 40 internal categories, generating structured extraction output in a proprietary JSON schema, writing product descriptions in an exact brand voice — a smaller, specialised model can match or exceed a large general model's quality at a fraction of the per-inference cost and latency. This pattern captures the pipeline to get there safely.

3. Solution overview¶

System context (C4 Level 1)¶

flowchart LR
    DataSource[(Curated Dataset\nHugging Face Hub / S3)] --> TrainPipeline[Training Pipeline]
    TrainPipeline --> Registry[(Model Registry\nHugging Face Hub private)]
    Registry --> Serving[Inference Service\nvLLM]
    User((Application)) --> Serving
    Serving --> Obs[Monitoring\ndrift + accuracy]
    Obs -->|triggers| TrainPipeline

Container view (C4 Level 2)¶

flowchart TB
    subgraph Training Pipeline
        RawData[(Raw Labels\nLabel Studio / Argilla)]
        DataPrep[Data Preprocessor\ntokenisation, dedup, quality filter]
        FTJob[Fine-tuning Job\nQLoRA on GPU — SageMaker / Modal]
        Eval[Evaluation\nautomated metrics + human golden set]
        Registry[(Model Registry\nHugging Face Hub private)]
    end
    subgraph Serving
        InferAPI[Inference API\nvLLM server — OpenAI-compatible]
        ResultCache[(Result Cache\nRedis — identical-input dedup)]
    end
    subgraph Monitoring
        CanaryEval[Canary Evaluator\nweekly golden-set score]
        DriftAlert[Drift Alert\naccuracy drop > 5%]
        RetrainTrigger[Retrain Trigger\nmanual or automated]
    end
    subgraph CI/CD
        GHActions[GitHub Actions\ntrain on push to main/data branch]
        ModelTest[Model Smoke Test\ntop-10 golden examples before deploy]
    end

    RawData --> DataPrep --> FTJob --> Eval
    Eval -->|passes threshold| Registry
    Registry --> InferAPI
    InferAPI --> ResultCache
    InferAPI --> CanaryEval --> DriftAlert --> RetrainTrigger
    GHActions --> FTJob
    Registry --> ModelTest --> InferAPI

4. Technology stack¶

Layer	Primary choice	Alternatives	Notes
Base model	Llama 3.1 8B or Mistral 7B v0.3	Phi-3-mini (3.8B), Gemma 2 9B, Qwen 2.5 7B	7–8B parameter models fine-tune on a single A100 (80 GB) and serve fast; move up to 13B–34B only if 8B quality is insufficient
Fine-tuning method	QLoRA (4-bit quantised LoRA)	LoRA (16-bit), full fine-tuning	QLoRA trains a 7B model in < 12 hours on an A100 with a fraction of the VRAM; full fine-tuning only if QLoRA accuracy is insufficient on your eval set
Fine-tuning library	Hugging Face TRL + PEFT	Axolotl, LLaMA-Factory, Unsloth	TRL + PEFT is the most actively maintained combination; Axolotl for YAML-config-driven training; Unsloth for 2× faster training on NVIDIA GPUs
Training infrastructure	AWS SageMaker Training Jobs	Modal, RunPod, Lambda Labs	SageMaker for teams on AWS — managed, auditable, no GPU instance management; Modal for pay-per-second GPU with simpler Python SDK
Dataset management	Hugging Face Datasets + Hub	Label Studio + S3, Argilla	Hugging Face Hub for version-controlled, shareable datasets; Label Studio for annotation workflows; Argilla for LLM-specific feedback collection
Evaluation	Custom pytest golden-set harness	Eleuther LM Eval Harness, RAGAS	Maintain a human-verified golden set of 200–500 examples; automated metrics (accuracy, F1, ROUGE) are necessary but not sufficient — sample human review weekly
Inference server	vLLM	HuggingFace TGI, llama.cpp, Ollama	vLLM for production: continuous batching delivers 10–20× throughput vs naïve inference; Ollama for local development only
Model registry	Hugging Face Hub (private repo)	MLflow Model Registry, W&B Artifacts	Hugging Face Hub integrates natively with training libraries; MLflow if the organisation already uses it for other models
Monitoring	Custom accuracy canary on weekly cron	Arize, WhyLabs, Evidently	Run the golden eval set on a weekly schedule; alert immediately on > 5% accuracy drop — this is the primary signal for retraining

5. Non-functional characteristics¶

Concern	Profile
Scalability	vLLM's continuous batching serves a 7B model at 500–2,000 tokens/second on an A10G GPU, handling burst traffic efficiently. Scale horizontally by adding GPU replicas behind a load balancer. Response caching (Redis) handles identical repeated inputs with zero model compute.
Availability target	99.5% on self-hosted GPU infrastructure — GPU hardware fails more frequently than managed cloud services. Always run a minimum of 2 GPU replicas in production with health-check-based routing; single-GPU deployments are too fragile.
Latency target	A 7B model on an A10G GPU: p95 < 400 ms for typical prompts (< 256 output tokens). This is 3–5× faster than calling Claude Haiku and 8–12× faster than Sonnet, which is the primary latency motivation for fine-tuning.
Security posture	Model weights are proprietary IP — treat them with the same access controls as source code. Restrict access to the model registry and inference endpoint. Fine-tuning data may contain sensitive training examples; apply the same data classification controls as production data. Audit every access to the training dataset.
Data residency	Inference is entirely within your own cloud account or on-premises — no data leaves your environment. This is the strongest data residency posture of all LLM patterns and is often the primary compliance motivation for fine-tuning over cloud APIs.
Compliance fit	GDPR ✓ — data stays in your infrastructure; no third-party data processor for inference. HIPAA ✓ — no BAA required for inference (no PHI leaves your network); BAA may still be required for training data sourced from third-party annotators. SOC 2 ✓ with model access audit log and training data lineage.

6. Cost ballpark¶

Indicative monthly USD cost. GPU compute dominates; fine-tuning is a one-time cost per training run.

Scale	Inference requests / month	Monthly cost	Cost drivers
Small	< 100,000	$500 - $2,000	1× A10G GPU instance (~$750/month on AWS), one-time training run cost ($50–200)
Medium	100k - 10M	$1,500 - $8,000	2–4 GPU instances for redundancy, storage, MLflow/W&B, occasional retraining
Large	10M+	$5,000 - $30,000	GPU fleet, autoscaling, full MLOps tooling, dedicated ML engineering time for model maintenance

7. LLM-assisted development fit¶

Aspect	Rating	Notes
Training script scaffolding (TRL + PEFT)	★★★★	Good — QLoRA training scripts are well-represented; verify LoRA rank, alpha, and target modules against your specific base model.
Dataset formatting and tokenisation	★★★★	Generates correct chat-template formatting for Llama/Mistral; verify the EOS token handling, which varies by model family.
vLLM serving configuration	★★★	Produces working vLLM launch commands and OpenAI-compatible API wrapper; GPU memory fraction and quantisation settings need tuning on actual hardware.
Evaluation harness	★★★	Scaffolds pytest-based eval runners correctly; defining the right metrics for your task requires domain expertise, not code generation.
Architecture decisions	★	Don't outsource — specifically the base model selection and fine-tuning method have compounding consequences on quality, cost, and maintenance.

Recommended workflow: Before any training, establish a baseline by measuring a general model (Claude Haiku, GPT-4o-mini) on your eval set with a well-engineered prompt. If the gap between the baseline and your quality target is < 10%, solve it with prompt engineering first. Only fine-tune when the gap is large and the task volume justifies the operational overhead.

8. Reference implementations¶

Public reference: huggingface/trl — Transformer Reinforcement Learning library; the primary fine-tuning library; examples/ contains SFT, DPO, and QLoRA training scripts for Llama, Mistral, and Gemma (200 OK ✓)
Public reference: huggingface/peft — Parameter-Efficient Fine-Tuning library implementing LoRA, QLoRA, and adapter methods; the foundation all QLoRA training builds on (200 OK ✓)
Public reference: vllm-project/vllm — high-throughput inference server with continuous batching; examples/ covers OpenAI-compatible API, multi-GPU tensor parallelism, and quantisation (200 OK ✓)
Internal case study: Add your anonymised internal example here

No ADRs recorded yet. Candidates: base model selection (Llama vs Mistral vs Phi), training infrastructure (SageMaker vs Modal vs RunPod), evaluation framework.

10. Known risks & gotchas¶

Training data quality determines model quality — at 7B scale there is no safety net — A general 70B model tolerates noisy training signals through scale; a 7B model does not. Garbage-in produces a confidently wrong model. Mitigation: invest in data quality before investing in training compute; human-verify a random sample of every training batch before the first training run.
Catastrophic forgetting degrades general capability — Fine-tuning on a narrow task removes the model's ability to handle anything outside that distribution. Mitigation: include a small proportion of general instruction-following examples in the training mix (5–10%); measure performance on a general benchmark (MMLU subset) alongside your task-specific eval.
Distribution shift silently degrades production accuracy — Production inputs drift from training distribution over months; accuracy drops gradually below the detection threshold. Mitigation: run the golden-set eval on a weekly automated schedule and alert on any 5% accuracy regression; do not rely on user complaints as your primary signal.
GPU infrastructure is high-ops — Hardware failures, CUDA version conflicts, out-of-memory errors during training, and driver incompatibilities are routine. Mitigation: use managed training (SageMaker) rather than self-managed GPU instances; containerise training scripts with pinned CUDA/torch versions; document the exact environment that produced the current production model.
Model rollback must be under 5 minutes — A bad model deployed to production needs to revert faster than a bad code deployment. Mitigation: keep the previous model version warm in the registry and scripted for fast swap; test the rollback procedure before it is needed, not during an incident.