Key Takeaway
By the end of this blueprint you will have a repeatable fine-tuning pipeline that curates and validates training data, orchestrates LoRA/QLoRA training jobs, evaluates checkpoints against standardized benchmarks, registers passing models in a versioned registry, and deploys them with canary rollouts and automatic rollback.
Prerequisites
- Python 3.11+ with PyTorch 2.x and the Hugging Face transformers library
- Access to GPU compute (cloud instances with A100/H100 or RunPod/Modal credits)
- A base model to fine-tune (Llama 3, Mistral, or Phi-3)
- Domain-specific data: at least 500 high-quality examples for LoRA fine-tuning
- Familiarity with training concepts: learning rate, epochs, loss curves
- W&B (Weights & Biases) or MLflow for experiment tracking
When to Fine-Tune vs. Prompt Engineer
Fine-tuning is not the default answer. It is expensive, requires curated data, and creates a model you must maintain. Reach for fine-tuning when prompt engineering has reached its ceiling: the model cannot follow your format consistently despite detailed instructions, the model lacks domain-specific vocabulary or reasoning patterns, you need to reduce inference costs by using a smaller model that matches a larger model's quality on your specific task, or you need to reduce latency by fitting the task into a model that runs on smaller hardware.
Try prompting and few-shot examples first. If you can get 80% of your target quality with prompting, you likely do not need fine-tuning. If you are stuck at 60% despite extensive prompt engineering, fine-tuning can often close the gap. The evaluation framework in this blueprint helps you measure exactly where you stand.
Data Curation Pipeline
Training data quality is the single largest determinant of fine-tuning success. The pipeline follows a four-stage process: collection (gathering raw examples from production logs, expert annotations, or synthetic generation), cleaning (deduplication, PII removal, format normalization), validation (schema checks, quality scoring, label verification), and splitting (train/validation/test with stratification on key attributes). Every dataset gets a version hash so you can reproduce any training run.
"""Data curation pipeline for fine-tuning datasets."""
from __future__ import annotations
import hashlib
import json
from pathlib import Path
from typing import Iterator
from pydantic import BaseModel, Field, field_validator
class TrainingExample(BaseModel):
"""A single training example in chat format."""
messages: list[dict[str, str]]
metadata: dict = Field(default_factory=dict)
@field_validator("messages")
@classmethod
def validate_messages(cls, v):
roles = [m["role"] for m in v]
if roles[0] != "system":
raise ValueError("First message must have role 'system'")
if roles[-1] != "assistant":
raise ValueError("Last message must have role 'assistant'")
return v
def load_and_validate(path: Path) -> Iterator[TrainingExample]:
"""Load JSONL and validate each example."""
with open(path) as f:
for i, line in enumerate(f):
try:
data = json.loads(line)
yield TrainingExample(**data)
except Exception as e:
print(f"Skipping line {i}: {e}")
def deduplicate(examples: list[TrainingExample]) -> list[TrainingExample]:
"""Remove exact duplicates based on content hash."""
seen = set()
unique = []
for ex in examples:
content_hash = hashlib.sha256(
json.dumps(ex.messages, sort_keys=True).encode()
).hexdigest()
if content_hash not in seen:
seen.add(content_hash)
unique.append(ex)
return unique
def split_dataset(
examples: list[TrainingExample],
train_ratio: float = 0.85,
val_ratio: float = 0.10,
) -> dict[str, list[TrainingExample]]:
"""Split into train/val/test with deterministic shuffle."""
import random
rng = random.Random(42)
shuffled = list(examples)
rng.shuffle(shuffled)
n = len(shuffled)
train_end = int(n * train_ratio)
val_end = train_end + int(n * val_ratio)
return {
"train": shuffled[:train_end],
"validation": shuffled[train_end:val_end],
"test": shuffled[val_end:],
}LoRA and QLoRA Training
Full fine-tuning updates every parameter in the model, requiring massive GPU memory and compute. LoRA (Low-Rank Adaptation) trains only a small set of adapter matrices that are merged with the base model weights at inference time. QLoRA goes further by quantizing the base model to 4-bit precision during training, reducing memory requirements by 4-8x. A 7B parameter model that requires 4x A100 GPUs for full fine-tuning can be LoRA-trained on a single A100 and QLoRA-trained on a single consumer GPU with 24GB VRAM.
"""LoRA/QLoRA fine-tuning with Hugging Face PEFT and TRL."""
from __future__ import annotations
import torch
from datasets import load_dataset
from peft import LoraConfig, TaskType, get_peft_model
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from trl import SFTTrainer
def train(
base_model: str = "meta-llama/Llama-3.1-8B-Instruct",
dataset_path: str = "./data/train.jsonl",
output_dir: str = "./checkpoints",
use_qlora: bool = True,
lora_r: int = 16,
lora_alpha: int = 32,
epochs: int = 3,
learning_rate: float = 2e-4,
batch_size: int = 4,
max_seq_length: int = 2048,
):
"""Run LoRA or QLoRA fine-tuning.
Args:
base_model: HuggingFace model ID for the base model.
dataset_path: Path to JSONL training data.
output_dir: Where to save checkpoints.
use_qlora: If True, quantize base model to 4-bit.
lora_r: LoRA rank (higher = more parameters).
lora_alpha: LoRA scaling factor.
"""
# Quantization config for QLoRA
quant_config = None
if use_qlora:
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# Load base model
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=quant_config,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.pad_token = tokenizer.eos_token
# LoRA config — target the attention layers
lora_config = LoraConfig(
r=lora_r,
lora_alpha=lora_alpha,
lora_dropout=0.05,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
task_type=TaskType.CAUSAL_LM,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # Typically 0.5-2% of total
# Load dataset
dataset = load_dataset("json", data_files=dataset_path, split="train")
# Training arguments
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
gradient_accumulation_steps=4,
learning_rate=learning_rate,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
bf16=True,
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="epoch",
report_to="wandb",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
)
trainer.train()
trainer.save_model(f"{output_dir}/final")Training Approach Comparison
| Approach | GPU Memory | Training Time | Quality | Cost |
|---|---|---|---|---|
| Full fine-tune (7B) | 4x A100 80GB | 4-8 hours | Highest | $50-200 |
| LoRA (7B, r=16) | 1x A100 40GB | 1-3 hours | Near-full | $10-40 |
| QLoRA (7B, r=16) | 1x GPU 24GB | 2-4 hours | Very good | $5-20 |
| OpenAI fine-tune API | Managed | 1-4 hours | Good | $15-50 |
Evaluation and Benchmarking
Every checkpoint must pass an evaluation benchmark before it is eligible for production. The benchmark suite should include: task-specific accuracy tests using your held-out test set, format compliance checks (does the model follow your output schema?), safety evaluations (does the model refuse harmful requests?), and regression tests against the base model to ensure fine-tuning did not degrade general capabilities. Automate this as a pipeline stage that runs after every training job completes.
Model Registry and Deployment
Passing checkpoints are registered in a model registry with metadata: base model, dataset version, training hyperparameters, evaluation scores, and the Git commit that produced the training config. The deployment controller pulls from the registry, deploys to a canary endpoint serving a small percentage of traffic, monitors quality metrics, and promotes to full traffic only when the canary shows no regressions. If quality drops, the controller automatically rolls back to the previous registered model.
- 1
Training completes
Checkpoint saved to artifact storage with training metadata.
- 2
Evaluation runs
Automated benchmark suite scores the checkpoint on all dimensions.
- 3
Registry entry created
Passing checkpoints are registered with scores and lineage metadata.
- 4
Canary deployment
Model deployed to serve 5% of production traffic.
- 5
Quality monitoring
Automated evaluator scores canary responses for 24 hours.
- 6
Promotion or rollback
Promote to 100% if metrics hold, rollback if they degrade.
Never skip the evaluation stage, even for small LoRA adapters. Fine-tuning can cause catastrophic forgetting where the model loses capabilities outside your training distribution. Always include general-capability benchmarks alongside your task-specific tests to catch this.
Data
Training
Deployment
Version History
1.0.0 · 2026-03-01
- • Initial publication with LoRA/QLoRA training pipeline using PEFT and TRL
- • Data curation pipeline with validation and deduplication
- • Training approach comparison table
- • Evaluation benchmarking and model registry patterns
- • Canary deployment with automatic rollback