Key Takeaway

Treating prompts as versioned, tested, and reviewed artifacts rather than inline strings eliminates a major source of production regressions in LLM-powered features. This guide covers the full lifecycle of production prompt management: version control, automated evaluation, A/B testing, CI/CD integration, and observability.

Prerequisites

At least one LLM-powered feature in production or nearing deployment
Version control system (Git) and CI/CD pipeline for your application
A set of representative test cases for evaluating prompt quality
Understanding of your LLM provider's API (token limits, model options, caching)
Logging infrastructure for tracking prompt performance metrics

Prompts Are Production Code

The fundamental shift in production prompt engineering is treating prompts as code, not as text. Prompts deserve version control, code review, automated testing, staged rollouts, and rollback capability -- exactly the same rigor applied to application code. When a prompt is an inline string in a source file, it changes without review, breaks without tests catching it, and cannot be rolled back independently of a code deployment. When a prompt is a versioned artifact with an evaluation pipeline, it gains all the production safety guarantees that software engineering has developed over decades.

Prompt Repository Structure

Organize prompts in a dedicated directory with a clear naming convention and metadata file for each prompt. Each prompt should have a unique identifier, a semantic version, a description of its purpose, the model and parameters it was tested with, and a reference to its evaluation dataset. This structure enables tooling to validate prompts, track changes, and associate evaluation results with specific prompt versions.

prompt-registry.ts

/**
 * Production prompt registry.
 *
 * Manages versioned prompts as first-class artifacts
 * with evaluation metadata and rollback support.
 */

interface PromptVersion {
  id: string;            // e.g., "product-search-v3.2.1"
  name: string;          // Human-readable name
  version: string;       // Semantic version
  template: string;      // The prompt template with {{variables}}
  model: string;         // Target model (e.g., "claude-sonnet-4-20250514")
  maxTokens: number;
  temperature: number;
  evaluationScore: number;  // Score from automated evaluation
  evaluationDate: string;
  status: "draft" | "testing" | "canary" | "production" | "deprecated";
  changelog: string;
}

class PromptRegistry {
  private prompts: Map<string, PromptVersion[]> = new Map();

  register(prompt: PromptVersion): void {
    const versions = this.prompts.get(prompt.name) || [];
    versions.push(prompt);
    this.prompts.set(prompt.name, versions);
  }

  getProduction(name: string): PromptVersion | undefined {
    const versions = this.prompts.get(name) || [];
    return versions.find((p) => p.status === "production");
  }

  getCanary(name: string): PromptVersion | undefined {
    const versions = this.prompts.get(name) || [];
    return versions.find((p) => p.status === "canary");
  }

  promote(
    name: string, version: string, to: PromptVersion["status"],
  ): void {
    const versions = this.prompts.get(name) || [];
    const target = versions.find((p) => p.version === version);
    if (!target) throw new Error(`Version ${version} not found`);

    // If promoting to production, demote current production
    if (to === "production") {
      const current = this.getProduction(name);
      if (current) current.status = "deprecated";
    }

    target.status = to;
  }

  rollback(name: string): PromptVersion | undefined {
    const versions = this.prompts.get(name) || [];
    // Find the most recent deprecated version
    const previous = versions
      .filter((p) => p.status === "deprecated")
      .sort((a, b) => b.version.localeCompare(a.version))[0];

    if (previous) {
      this.promote(name, previous.version, "production");
    }
    return previous;
  }
}

Automated Evaluation

Every prompt change must pass an automated evaluation before reaching production. The evaluation pipeline runs the updated prompt against a golden dataset -- a curated set of inputs with expected outputs or quality criteria -- and measures quality metrics. These metrics vary by use case: for classification prompts, measure accuracy and F1 score; for generation prompts, measure coherence, factual accuracy, and instruction following; for extraction prompts, measure precision and recall against expected fields.

prompt_evaluator.py

"""Automated prompt evaluation pipeline.

Runs a prompt variant against a golden dataset
and produces a quality report that gates deployment.
"""

from dataclasses import dataclass
from typing import List, Dict, Callable, Any
import json


@dataclass
class EvalCase:
    """A single evaluation test case."""
    input_vars: Dict[str, str]
    expected_output: str  # Or criteria for judgment
    category: str  # For slice-based analysis


@dataclass
class EvalResult:
    """Result of evaluating one test case."""
    case: EvalCase
    actual_output: str
    score: float  # 0.0 - 1.0
    latency_ms: float
    token_count: int
    notes: str


def evaluate_prompt(
    prompt_template: str,
    eval_cases: List[EvalCase],
    llm_call_fn: Callable,
    scoring_fn: Callable,
    model: str = "claude-sonnet-4-20250514",
) -> Dict[str, Any]:
    """Evaluate a prompt against a golden dataset.

    Args:
        prompt_template: The prompt with {{variable}} placeholders
        eval_cases: Test cases with inputs and expected outputs
        llm_call_fn: Function to call the LLM
        scoring_fn: Function to score output vs expected
        model: Model to evaluate against

    Returns:
        Evaluation report with overall and per-category scores
    """
    results: List[EvalResult] = []

    for case in eval_cases:
        # Render the prompt
        prompt = prompt_template
        for key, value in case.input_vars.items():
            prompt = prompt.replace(f"{{{{{key}}}}}", value)

        # Call the LLM
        import time
        start = time.time()
        output = llm_call_fn(prompt, model=model)
        latency = (time.time() - start) * 1000

        # Score the output
        score = scoring_fn(output, case.expected_output)

        results.append(EvalResult(
            case=case,
            actual_output=output,
            score=score,
            latency_ms=latency,
            token_count=len(output.split()),  # Rough estimate
            notes="",
        ))

    # Aggregate results
    overall_score = sum(r.score for r in results) / len(results)
    by_category: Dict[str, float] = {}
    for cat in set(r.case.category for r in results):
        cat_results = [r for r in results if r.case.category == cat]
        by_category[cat] = sum(
            r.score for r in cat_results
        ) / len(cat_results)

    return {
        "overall_score": round(overall_score, 4),
        "by_category": by_category,
        "total_cases": len(results),
        "pass_rate": sum(
            1 for r in results if r.score >= 0.8
        ) / len(results),
        "avg_latency_ms": sum(
            r.latency_ms for r in results
        ) / len(results),
        "passes_threshold": overall_score >= 0.85,
    }

CI/CD Integration

Prompt changes should flow through the same CI/CD pipeline as code changes. When a prompt file is modified in a pull request, the CI pipeline runs the evaluation suite against the updated prompt, compares the results against the current production prompt's scores, and blocks the merge if quality regresses. This catches regressions before they reach production and creates a documented history of prompt quality over time.

Run evaluations against two models: the current production model and the next model you plan to migrate to. This builds a library of evaluation data that accelerates model migrations. When you switch models, you already know how your prompts perform on the new model and which ones need adjustment.

A/B Testing Prompts

For prompts where offline evaluation cannot fully predict production quality (most generation and conversation prompts fall into this category), A/B testing on live traffic provides the definitive answer. Route a percentage of traffic to the new prompt variant and compare quality metrics, user engagement, and business outcomes against the control. Use the canary pattern: start at 5% traffic, evaluate for 24-48 hours, and expand to 50% if metrics are positive before promoting to 100%.

1
Step 1: Define Success Metrics
Before running the test, define which metrics determine success: output quality scores, user satisfaction proxies (click-through, completion rate, thumbs up/down), token efficiency, and latency. Set a minimum improvement threshold.
2
Step 2: Canary Deployment (5% traffic)
Route 5% of traffic to the new prompt. Monitor for regression across all metrics for 24-48 hours. If any metric degrades significantly, abort the test and rollback.
3
Step 3: Expanded Test (50% traffic)
If canary metrics are positive, expand to 50%. Run for enough time to reach statistical significance on your primary metric. Typically 3-7 days depending on traffic volume.
4
Step 4: Promotion or Rollback
If the variant wins on the primary metric without regressing secondary metrics, promote to 100%. If results are inconclusive or negative, rollback to the control and iterate on the prompt.

0/8 completed

Version History

1.0.0 · 2026-03-01

• Initial release with production prompt management lifecycle
• Prompt registry implementation in TypeScript
• Automated evaluation pipeline in Python
• A/B testing process with four-step canary deployment
• CI/CD integration guidance and readiness checklist

Prompts Are Production Code

Prompt Repository Structure

prompt-registry.ts

/**
 * Production prompt registry.
 *
 * Manages versioned prompts as first-class artifacts
 * with evaluation metadata and rollback support.
 */

interface PromptVersion {
  id: string;            // e.g., "product-search-v3.2.1"
  name: string;          // Human-readable name
  version: string;       // Semantic version
  template: string;      // The prompt template with {{variables}}
  model: string;         // Target model (e.g., "claude-sonnet-4-20250514")
  maxTokens: number;
  temperature: number;
  evaluationScore: number;  // Score from automated evaluation
  evaluationDate: string;
  status: "draft" | "testing" | "canary" | "production" | "deprecated";
  changelog: string;
}

class PromptRegistry {
  private prompts: Map<string, PromptVersion[]> = new Map();

  register(prompt: PromptVersion): void {
    const versions = this.prompts.get(prompt.name) || [];
    versions.push(prompt);
    this.prompts.set(prompt.name, versions);
  }

  getProduction(name: string): PromptVersion | undefined {
    const versions = this.prompts.get(name) || [];
    return versions.find((p) => p.status === "production");
  }

  getCanary(name: string): PromptVersion | undefined {
    const versions = this.prompts.get(name) || [];
    return versions.find((p) => p.status === "canary");
  }

  promote(
    name: string, version: string, to: PromptVersion["status"],
  ): void {
    const versions = this.prompts.get(name) || [];
    const target = versions.find((p) => p.version === version);
    if (!target) throw new Error(`Version ${version} not found`);

    // If promoting to production, demote current production
    if (to === "production") {
      const current = this.getProduction(name);
      if (current) current.status = "deprecated";
    }

    target.status = to;
  }

  rollback(name: string): PromptVersion | undefined {
    const versions = this.prompts.get(name) || [];
    // Find the most recent deprecated version
    const previous = versions
      .filter((p) => p.status === "deprecated")
      .sort((a, b) => b.version.localeCompare(a.version))[0];

    if (previous) {
      this.promote(name, previous.version, "production");
    }
    return previous;
  }
}

Automated Evaluation

prompt_evaluator.py

"""Automated prompt evaluation pipeline.

Runs a prompt variant against a golden dataset
and produces a quality report that gates deployment.
"""

from dataclasses import dataclass
from typing import List, Dict, Callable, Any
import json


@dataclass
class EvalCase:
    """A single evaluation test case."""
    input_vars: Dict[str, str]
    expected_output: str  # Or criteria for judgment
    category: str  # For slice-based analysis


@dataclass
class EvalResult:
    """Result of evaluating one test case."""
    case: EvalCase
    actual_output: str
    score: float  # 0.0 - 1.0
    latency_ms: float
    token_count: int
    notes: str


def evaluate_prompt(
    prompt_template: str,
    eval_cases: List[EvalCase],
    llm_call_fn: Callable,
    scoring_fn: Callable,
    model: str = "claude-sonnet-4-20250514",
) -> Dict[str, Any]:
    """Evaluate a prompt against a golden dataset.

    Args:
        prompt_template: The prompt with {{variable}} placeholders
        eval_cases: Test cases with inputs and expected outputs
        llm_call_fn: Function to call the LLM
        scoring_fn: Function to score output vs expected
        model: Model to evaluate against

    Returns:
        Evaluation report with overall and per-category scores
    """
    results: List[EvalResult] = []

    for case in eval_cases:
        # Render the prompt
        prompt = prompt_template
        for key, value in case.input_vars.items():
            prompt = prompt.replace(f"{{{{{key}}}}}", value)

        # Call the LLM
        import time
        start = time.time()
        output = llm_call_fn(prompt, model=model)
        latency = (time.time() - start) * 1000

        # Score the output
        score = scoring_fn(output, case.expected_output)

        results.append(EvalResult(
            case=case,
            actual_output=output,
            score=score,
            latency_ms=latency,
            token_count=len(output.split()),  # Rough estimate
            notes="",
        ))

    # Aggregate results
    overall_score = sum(r.score for r in results) / len(results)
    by_category: Dict[str, float] = {}
    for cat in set(r.case.category for r in results):
        cat_results = [r for r in results if r.case.category == cat]
        by_category[cat] = sum(
            r.score for r in cat_results
        ) / len(cat_results)

    return {
        "overall_score": round(overall_score, 4),
        "by_category": by_category,
        "total_cases": len(results),
        "pass_rate": sum(
            1 for r in results if r.score >= 0.8
        ) / len(results),
        "avg_latency_ms": sum(
            r.latency_ms for r in results
        ) / len(results),
        "passes_threshold": overall_score >= 0.85,
    }

CI/CD Integration

A/B Testing Prompts

Step 1: Define Success Metrics

Before running the test, define which metrics determine success: output quality scores, user satisfaction proxies (click-through, completion rate, thumbs up/down), token efficiency, and latency. Set a minimum improvement threshold.

Step 2: Canary Deployment (5% traffic)

Route 5% of traffic to the new prompt. Monitor for regression across all metrics for 24-48 hours. If any metric degrades significantly, abort the test and rollback.

Step 3: Expanded Test (50% traffic)

If canary metrics are positive, expand to 50%. Run for enough time to reach statistical significance on your primary metric. Typically 3-7 days depending on traffic volume.

Step 4: Promotion or Rollback

If the variant wins on the primary metric without regressing secondary metrics, promote to 100%. If results are inconclusive or negative, rollback to the control and iterate on the prompt.

0/8 completed

Version History

1.0.0 · 2026-03-01

• Initial release with production prompt management lifecycle
• Prompt registry implementation in TypeScript
• Automated evaluation pipeline in Python
• A/B testing process with four-step canary deployment
• CI/CD integration guidance and readiness checklist

Prompt Engineering Production Guide

Prompts Are Production Code

Prompt Repository Structure

Automated Evaluation

CI/CD Integration

A/B Testing Prompts

Step 1: Define Success Metrics

Step 2: Canary Deployment (5% traffic)

Step 3: Expanded Test (50% traffic)

Step 4: Promotion or Rollback

Version History

Related content

Prompt Engineering Production Guide

Prompts Are Production Code

Prompt Repository Structure

Automated Evaluation

CI/CD Integration

A/B Testing Prompts

Step 1: Define Success Metrics

Step 2: Canary Deployment (5% traffic)

Step 3: Expanded Test (50% traffic)

Step 4: Promotion or Rollback

Version History

Related content