Key Takeaway

By the end of this blueprint you will have a repeatable fine-tuning pipeline that curates and validates training data, orchestrates LoRA/QLoRA training jobs, evaluates checkpoints against standardized benchmarks, registers passing models in a versioned registry, and deploys them with canary rollouts and automatic rollback.

Prerequisites

Python 3.11+ with PyTorch 2.x and the Hugging Face transformers library
Access to GPU compute (cloud instances with A100/H100 or RunPod/Modal credits)
A base model to fine-tune (Llama 3, Mistral, or Phi-3)
Domain-specific data: at least 500 high-quality examples for LoRA fine-tuning
Familiarity with training concepts: learning rate, epochs, loss curves
W&B (Weights & Biases) or MLflow for experiment tracking

When to Fine-Tune vs. Prompt Engineer

Fine-tuning is not the default answer. It is expensive, requires curated data, and creates a model you must maintain. Reach for fine-tuning when prompt engineering has reached its ceiling: the model cannot follow your format consistently despite detailed instructions, the model lacks domain-specific vocabulary or reasoning patterns, you need to reduce inference costs by using a smaller model that matches a larger model's quality on your specific task, or you need to reduce latency by fitting the task into a model that runs on smaller hardware.

Try prompting and few-shot examples first. If you can get 80% of your target quality with prompting, you likely do not need fine-tuning. If you are stuck at 60% despite extensive prompt engineering, fine-tuning can often close the gap. The evaluation framework in this blueprint helps you measure exactly where you stand.

Data Curation Pipeline

Training data quality is the single largest determinant of fine-tuning success. The pipeline follows a four-stage process: collection (gathering raw examples from production logs, expert annotations, or synthetic generation), cleaning (deduplication, PII removal, format normalization), validation (schema checks, quality scoring, label verification), and splitting (train/validation/test with stratification on key attributes). Every dataset gets a version hash so you can reproduce any training run.

data/prepare_dataset.py

"""Data curation pipeline for fine-tuning datasets."""

from __future__ import annotations

import hashlib
import json
from pathlib import Path
from typing import Iterator

from pydantic import BaseModel, Field, field_validator


class TrainingExample(BaseModel):
    """A single training example in chat format."""

    messages: list[dict[str, str]]
    metadata: dict = Field(default_factory=dict)

    @field_validator("messages")
    @classmethod
    def validate_messages(cls, v):
        roles = [m["role"] for m in v]
        if roles[0] != "system":
            raise ValueError("First message must have role 'system'")
        if roles[-1] != "assistant":
            raise ValueError("Last message must have role 'assistant'")
        return v


def load_and_validate(path: Path) -> Iterator[TrainingExample]:
    """Load JSONL and validate each example."""
    with open(path) as f:
        for i, line in enumerate(f):
            try:
                data = json.loads(line)
                yield TrainingExample(**data)
            except Exception as e:
                print(f"Skipping line {i}: {e}")


def deduplicate(examples: list[TrainingExample]) -> list[TrainingExample]:
    """Remove exact duplicates based on content hash."""
    seen = set()
    unique = []
    for ex in examples:
        content_hash = hashlib.sha256(
            json.dumps(ex.messages, sort_keys=True).encode()
        ).hexdigest()
        if content_hash not in seen:
            seen.add(content_hash)
            unique.append(ex)
    return unique


def split_dataset(
    examples: list[TrainingExample],
    train_ratio: float = 0.85,
    val_ratio: float = 0.10,
) -> dict[str, list[TrainingExample]]:
    """Split into train/val/test with deterministic shuffle."""
    import random
    rng = random.Random(42)
    shuffled = list(examples)
    rng.shuffle(shuffled)

    n = len(shuffled)
    train_end = int(n * train_ratio)
    val_end = train_end + int(n * val_ratio)

    return {
        "train": shuffled[:train_end],
        "validation": shuffled[train_end:val_end],
        "test": shuffled[val_end:],
    }

LoRA and QLoRA Training

Full fine-tuning updates every parameter in the model, requiring massive GPU memory and compute. LoRA (Low-Rank Adaptation) trains only a small set of adapter matrices that are merged with the base model weights at inference time. QLoRA goes further by quantizing the base model to 4-bit precision during training, reducing memory requirements by 4-8x. A 7B parameter model that requires 4x A100 GPUs for full fine-tuning can be LoRA-trained on a single A100 and QLoRA-trained on a single consumer GPU with 24GB VRAM.

training/train_lora.py

"""LoRA/QLoRA fine-tuning with Hugging Face PEFT and TRL."""

from __future__ import annotations

import torch
from datasets import load_dataset
from peft import LoraConfig, TaskType, get_peft_model
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer


def train(
    base_model: str = "meta-llama/Llama-3.1-8B-Instruct",
    dataset_path: str = "./data/train.jsonl",
    output_dir: str = "./checkpoints",
    use_qlora: bool = True,
    lora_r: int = 16,
    lora_alpha: int = 32,
    epochs: int = 3,
    learning_rate: float = 2e-4,
    batch_size: int = 4,
    max_seq_length: int = 2048,
):
    """Run LoRA or QLoRA fine-tuning.

    Args:
        base_model: HuggingFace model ID for the base model.
        dataset_path: Path to JSONL training data.
        output_dir: Where to save checkpoints.
        use_qlora: If True, quantize base model to 4-bit.
        lora_r: LoRA rank (higher = more parameters).
        lora_alpha: LoRA scaling factor.
    """
    # Quantization config for QLoRA
    quant_config = None
    if use_qlora:
        quant_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
        )

    # Load base model
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        quantization_config=quant_config,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )

    tokenizer = AutoTokenizer.from_pretrained(base_model)
    tokenizer.pad_token = tokenizer.eos_token

    # LoRA config — target the attention layers
    lora_config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        lora_dropout=0.05,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        task_type=TaskType.CAUSAL_LM,
        bias="none",
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()  # Typically 0.5-2% of total

    # Load dataset
    dataset = load_dataset("json", data_files=dataset_path, split="train")

    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=epochs,
        per_device_train_batch_size=batch_size,
        gradient_accumulation_steps=4,
        learning_rate=learning_rate,
        lr_scheduler_type="cosine",
        warmup_ratio=0.1,
        bf16=True,
        logging_steps=10,
        save_strategy="epoch",
        evaluation_strategy="epoch",
        report_to="wandb",
    )

    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        tokenizer=tokenizer,
        max_seq_length=max_seq_length,
    )

    trainer.train()
    trainer.save_model(f"{output_dir}/final")

Training Approach Comparison

Approach	GPU Memory	Training Time	Quality	Cost
Full fine-tune (7B)	4x A100 80GB	4-8 hours	Highest	$50-200
LoRA (7B, r=16)	1x A100 40GB	1-3 hours	Near-full	$10-40
QLoRA (7B, r=16)	1x GPU 24GB	2-4 hours	Very good	$5-20
OpenAI fine-tune API	Managed	1-4 hours	Good	$15-50

Evaluation and Benchmarking

Every checkpoint must pass an evaluation benchmark before it is eligible for production. The benchmark suite should include: task-specific accuracy tests using your held-out test set, format compliance checks (does the model follow your output schema?), safety evaluations (does the model refuse harmful requests?), and regression tests against the base model to ensure fine-tuning did not degrade general capabilities. Automate this as a pipeline stage that runs after every training job completes.

Model Registry and Deployment

Passing checkpoints are registered in a model registry with metadata: base model, dataset version, training hyperparameters, evaluation scores, and the Git commit that produced the training config. The deployment controller pulls from the registry, deploys to a canary endpoint serving a small percentage of traffic, monitors quality metrics, and promotes to full traffic only when the canary shows no regressions. If quality drops, the controller automatically rolls back to the previous registered model.

1
Training completes
Checkpoint saved to artifact storage with training metadata.
2
Evaluation runs
Automated benchmark suite scores the checkpoint on all dimensions.
3
Registry entry created
Passing checkpoints are registered with scores and lineage metadata.
4
Canary deployment
Model deployed to serve 5% of production traffic.
5
Quality monitoring
Automated evaluator scores canary responses for 24 hours.
6
Promotion or rollback
Promote to 100% if metrics hold, rollback if they degrade.

Never skip the evaluation stage, even for small LoRA adapters. Fine-tuning can cause catastrophic forgetting where the model loses capabilities outside your training distribution. Always include general-capability benchmarks alongside your task-specific tests to catch this.

Data

Training

Deployment

Version History

1.0.0 · 2026-03-01

• Initial publication with LoRA/QLoRA training pipeline using PEFT and TRL
• Data curation pipeline with validation and deduplication
• Training approach comparison table
• Evaluation benchmarking and model registry patterns
• Canary deployment with automatic rollback

Key Takeaway

Prerequisites

Python 3.11+ with PyTorch 2.x and the Hugging Face transformers library
Access to GPU compute (cloud instances with A100/H100 or RunPod/Modal credits)
A base model to fine-tune (Llama 3, Mistral, or Phi-3)
Domain-specific data: at least 500 high-quality examples for LoRA fine-tuning
Familiarity with training concepts: learning rate, epochs, loss curves
W&B (Weights & Biases) or MLflow for experiment tracking

When to Fine-Tune vs. Prompt Engineer

Data Curation Pipeline

data/prepare_dataset.py

"""Data curation pipeline for fine-tuning datasets."""

from __future__ import annotations

import hashlib
import json
from pathlib import Path
from typing import Iterator

from pydantic import BaseModel, Field, field_validator


class TrainingExample(BaseModel):
    """A single training example in chat format."""

    messages: list[dict[str, str]]
    metadata: dict = Field(default_factory=dict)

    @field_validator("messages")
    @classmethod
    def validate_messages(cls, v):
        roles = [m["role"] for m in v]
        if roles[0] != "system":
            raise ValueError("First message must have role 'system'")
        if roles[-1] != "assistant":
            raise ValueError("Last message must have role 'assistant'")
        return v


def load_and_validate(path: Path) -> Iterator[TrainingExample]:
    """Load JSONL and validate each example."""
    with open(path) as f:
        for i, line in enumerate(f):
            try:
                data = json.loads(line)
                yield TrainingExample(**data)
            except Exception as e:
                print(f"Skipping line {i}: {e}")


def deduplicate(examples: list[TrainingExample]) -> list[TrainingExample]:
    """Remove exact duplicates based on content hash."""
    seen = set()
    unique = []
    for ex in examples:
        content_hash = hashlib.sha256(
            json.dumps(ex.messages, sort_keys=True).encode()
        ).hexdigest()
        if content_hash not in seen:
            seen.add(content_hash)
            unique.append(ex)
    return unique


def split_dataset(
    examples: list[TrainingExample],
    train_ratio: float = 0.85,
    val_ratio: float = 0.10,
) -> dict[str, list[TrainingExample]]:
    """Split into train/val/test with deterministic shuffle."""
    import random
    rng = random.Random(42)
    shuffled = list(examples)
    rng.shuffle(shuffled)

    n = len(shuffled)
    train_end = int(n * train_ratio)
    val_end = train_end + int(n * val_ratio)

    return {
        "train": shuffled[:train_end],
        "validation": shuffled[train_end:val_end],
        "test": shuffled[val_end:],
    }

LoRA and QLoRA Training

training/train_lora.py

"""LoRA/QLoRA fine-tuning with Hugging Face PEFT and TRL."""

from __future__ import annotations

import torch
from datasets import load_dataset
from peft import LoraConfig, TaskType, get_peft_model
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer


def train(
    base_model: str = "meta-llama/Llama-3.1-8B-Instruct",
    dataset_path: str = "./data/train.jsonl",
    output_dir: str = "./checkpoints",
    use_qlora: bool = True,
    lora_r: int = 16,
    lora_alpha: int = 32,
    epochs: int = 3,
    learning_rate: float = 2e-4,
    batch_size: int = 4,
    max_seq_length: int = 2048,
):
    """Run LoRA or QLoRA fine-tuning.

    Args:
        base_model: HuggingFace model ID for the base model.
        dataset_path: Path to JSONL training data.
        output_dir: Where to save checkpoints.
        use_qlora: If True, quantize base model to 4-bit.
        lora_r: LoRA rank (higher = more parameters).
        lora_alpha: LoRA scaling factor.
    """
    # Quantization config for QLoRA
    quant_config = None
    if use_qlora:
        quant_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
        )

    # Load base model
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        quantization_config=quant_config,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )

    tokenizer = AutoTokenizer.from_pretrained(base_model)
    tokenizer.pad_token = tokenizer.eos_token

    # LoRA config — target the attention layers
    lora_config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        lora_dropout=0.05,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        task_type=TaskType.CAUSAL_LM,
        bias="none",
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()  # Typically 0.5-2% of total

    # Load dataset
    dataset = load_dataset("json", data_files=dataset_path, split="train")

    # Training arguments
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=epochs,
        per_device_train_batch_size=batch_size,
        gradient_accumulation_steps=4,
        learning_rate=learning_rate,
        lr_scheduler_type="cosine",
        warmup_ratio=0.1,
        bf16=True,
        logging_steps=10,
        save_strategy="epoch",
        evaluation_strategy="epoch",
        report_to="wandb",
    )

    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        tokenizer=tokenizer,
        max_seq_length=max_seq_length,
    )

    trainer.train()
    trainer.save_model(f"{output_dir}/final")

Training Approach Comparison

Approach	GPU Memory	Training Time	Quality	Cost
Full fine-tune (7B)	4x A100 80GB	4-8 hours	Highest	$50-200
LoRA (7B, r=16)	1x A100 40GB	1-3 hours	Near-full	$10-40
QLoRA (7B, r=16)	1x GPU 24GB	2-4 hours	Very good	$5-20
OpenAI fine-tune API	Managed	1-4 hours	Good	$15-50

Evaluation and Benchmarking

Model Registry and Deployment

1
Training completes
Checkpoint saved to artifact storage with training metadata.
2
Evaluation runs
Automated benchmark suite scores the checkpoint on all dimensions.
3
Registry entry created
Passing checkpoints are registered with scores and lineage metadata.
4
Canary deployment
Model deployed to serve 5% of production traffic.
5
Quality monitoring
Automated evaluator scores canary responses for 24 hours.
6
Promotion or rollback
Promote to 100% if metrics hold, rollback if they degrade.

Data

Training

Deployment

Version History

1.0.0 · 2026-03-01

• Initial publication with LoRA/QLoRA training pipeline using PEFT and TRL
• Data curation pipeline with validation and deduplication
• Training approach comparison table
• Evaluation benchmarking and model registry patterns
• Canary deployment with automatic rollback

Fine-Tuning Pipeline

When to Fine-Tune vs. Prompt Engineer

Data Curation Pipeline

LoRA and QLoRA Training

Training Approach Comparison

Evaluation and Benchmarking

Model Registry and Deployment

Training completes

Evaluation runs

Registry entry created

Canary deployment

Quality monitoring

Promotion or rollback

Data

Training

Deployment

Version History

Related content

Fine-Tuning Pipeline

When to Fine-Tune vs. Prompt Engineer

Data Curation Pipeline

LoRA and QLoRA Training

Training Approach Comparison

Evaluation and Benchmarking

Model Registry and Deployment

Training completes

Evaluation runs

Registry entry created

Canary deployment

Quality monitoring

Promotion or rollback

Data

Training

Deployment

Version History

Related content