Key Takeaway

By the end of this blueprint you will have an AI observability stack that captures distributed traces across LLM calls and tool invocations using OpenTelemetry, feeds cost attribution dashboards in Grafana, runs automated quality scoring with LLM-as-judge evaluators, and alerts on regressions before users notice.

Prerequisites

An LLM application in production (or staging) generating real traffic
Docker Compose for running the collector, Prometheus, and Grafana locally
Python 3.11+ with the OpenTelemetry SDK installed
Familiarity with distributed tracing concepts (traces, spans, attributes)
Optional: a Langfuse or LangSmith account for managed LLM tracing

Why Traditional APM Falls Short

Traditional APM tools track request latency, error rates, and throughput. These are necessary but insufficient for AI applications. An LLM call can return HTTP 200 with a perfectly structured response that is factually wrong, off-brand, or unsafe. You need three additional metric dimensions: quality (is the output good?), cost (what did this call cost and who should pay for it?), and safety (does the output violate any policies?). AI observability layers these dimensions on top of standard infrastructure metrics.

Architecture Overview

The stack is built on OpenTelemetry for trace collection, with custom span attributes for LLM-specific metadata such as model name, token counts, and prompt versions. Traces flow into a collector that fans out to a time-series database for metrics, a search index for trace exploration, and an evaluation pipeline that periodically scores sampled outputs for quality and safety.

Instrumenting LLM Calls with OpenTelemetry

The instrumentation layer wraps every LLM call in an OpenTelemetry span with custom semantic attributes. These attributes follow the emerging OpenTelemetry GenAI semantic conventions: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and custom attributes for cost, prompt version, and feature identifier. This data flows through the standard OTel pipeline and can be consumed by any OTel-compatible backend.

observability/tracing.py

"""OpenTelemetry instrumentation for LLM calls."""

from __future__ import annotations

import time
from contextlib import contextmanager
from typing import Generator

from opentelemetry import trace
from opentelemetry.trace import StatusCode

tracer = trace.get_tracer("ai.llm", "1.0.0")

# Pricing per token (as of early 2026)
TOKEN_COSTS = {
    "claude-sonnet-4-20250514": {"input": 3.0e-6, "output": 15.0e-6},
    "claude-haiku-4-5-20251001": {"input": 0.8e-6, "output": 4.0e-6},
    "gpt-4o": {"input": 2.5e-6, "output": 10.0e-6},
}


@contextmanager
def trace_llm_call(
    model: str,
    feature: str,
    prompt_version: str | None = None,
    team: str | None = None,
) -> Generator[dict, None, None]:
    """Context manager that wraps an LLM call in an OTel span.

    Usage:
        with trace_llm_call("claude-sonnet-4-20250514", "chat") as ctx:
            response = llm.invoke(messages)
            ctx["input_tokens"] = response.usage.input_tokens
            ctx["output_tokens"] = response.usage.output_tokens
    """
    ctx: dict = {}
    with tracer.start_as_current_span(
        "llm.call",
        attributes={
            "gen_ai.system": _provider(model),
            "gen_ai.request.model": model,
            "ai.feature": feature,
            "ai.prompt_version": prompt_version or "unknown",
            "ai.team": team or "default",
        },
    ) as span:
        start = time.perf_counter()
        try:
            yield ctx
            # Post-call: record usage
            input_t = ctx.get("input_tokens", 0)
            output_t = ctx.get("output_tokens", 0)
            costs = TOKEN_COSTS.get(model, {"input": 0, "output": 0})
            cost = input_t * costs["input"] + output_t * costs["output"]

            span.set_attribute("gen_ai.usage.input_tokens", input_t)
            span.set_attribute("gen_ai.usage.output_tokens", output_t)
            span.set_attribute("ai.cost_usd", round(cost, 6))
            span.set_attribute("ai.latency_ms", int((time.perf_counter() - start) * 1000))
            span.set_status(StatusCode.OK)
        except Exception as exc:
            span.set_status(StatusCode.ERROR, str(exc))
            span.record_exception(exc)
            raise


def _provider(model: str) -> str:
    if "claude" in model:
        return "anthropic"
    if "gpt" in model:
        return "openai"
    return "unknown"

Tracing Tool Comparison

Feature	Langfuse	LangSmith	Custom OTel
Trace visualization	Excellent	Excellent	Jaeger/Grafana Tempo
Cost tracking	Built-in	Built-in	Custom attributes + Grafana
LLM-as-judge eval	Built-in	Built-in	Custom pipeline
Self-hosted option	Yes (OSS)	No	Yes
Data residency	EU/US or self-host	US only	Full control
Pricing	Free tier + usage	Free tier + usage	Infrastructure only

Cost Attribution Dashboards

Cost attribution is the killer feature of AI observability. By tagging every LLM span with team, feature, and prompt_version attributes, you can build Grafana dashboards that answer: which team spent the most this week? Which feature has the highest per-request cost? Did the latest prompt change increase or decrease token usage? These dashboards transform LLM cost from an opaque line item into an actionable engineering metric.

otel-collector-config.yaml

# OpenTelemetry Collector config for AI observability
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

  # Extract LLM metrics from span attributes
  spanmetrics:
    metrics_exporter: prometheus
    dimensions:
      - name: gen_ai.request.model
      - name: ai.feature
      - name: ai.team
    histogram:
      explicit:
        boundaries: [100, 250, 500, 1000, 2500, 5000, 10000]

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889

  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, spanmetrics]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Automated Quality Scoring

Sample a percentage of production responses and score them automatically using an LLM-as-judge evaluator. The evaluator receives the original prompt, retrieved context (if applicable), and the generated response, then scores on dimensions like factual accuracy, instruction following, and tone. Store scores alongside the trace so you can correlate quality regressions with specific prompt changes, model updates, or retrieval pipeline modifications.

observability/evaluator.py

"""Automated quality scoring using LLM-as-judge."""

from __future__ import annotations

from pydantic import BaseModel, Field
from anthropic import Anthropic

client = Anthropic()


class QualityScore(BaseModel):
    accuracy: int = Field(ge=1, le=5, description="Factual accuracy")
    relevance: int = Field(ge=1, le=5, description="Relevance to query")
    safety: int = Field(ge=1, le=5, description="Safety compliance")
    reasoning: str = Field(description="Brief justification")


JUDGE_PROMPT = """Rate this AI response on a 1-5 scale for each dimension.

User query: {query}
AI response: {response}
{context_section}

Score each dimension:
- accuracy (1-5): Is the response factually correct?
- relevance (1-5): Does it address the user's question?
- safety (1-5): Does it comply with content policies?

Return JSON with accuracy, relevance, safety, and reasoning fields."""


async def score_response(
    query: str,
    response: str,
    context: str | None = None,
) -> QualityScore:
    """Score a response using Claude as a judge."""
    context_section = (
        f"Retrieved context: {context}" if context else ""
    )
    result = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(
                query=query,
                response=response,
                context_section=context_section,
            ),
        }],
    )
    # Parse structured response
    import json
    data = json.loads(result.content[0].text)
    return QualityScore(**data)

Use a cheaper, faster model for judge evaluations (e.g., Claude Haiku) to keep scoring costs manageable. At a 10% sampling rate with Haiku, quality scoring adds roughly $0.001 per evaluated response. Score 100% of responses only during A/B tests or after prompt changes.

Alerting on Regressions

Configure alerts on three signal types: latency regressions (p95 LLM call latency exceeds baseline by more than 50%), quality regressions (average quality score drops below threshold over a rolling window), and cost spikes (hourly spend exceeds 2x the daily average). Each alert should include the affected feature, team, and model so the on-call engineer can triage quickly. Integrate alerts with your existing incident management system.

Instrumentation

Dashboards

Alerting

Version History

1.0.0 · 2026-03-01

• Initial publication with OpenTelemetry instrumentation for LLM calls
• OTel Collector config with span-to-metrics conversion
• LLM-as-judge automated quality scoring pipeline
• Tracing tool comparison: Langfuse vs LangSmith vs custom OTel
• Alerting patterns for latency, quality, and cost regressions

Key Takeaway

Prerequisites

An LLM application in production (or staging) generating real traffic
Docker Compose for running the collector, Prometheus, and Grafana locally
Python 3.11+ with the OpenTelemetry SDK installed
Familiarity with distributed tracing concepts (traces, spans, attributes)
Optional: a Langfuse or LangSmith account for managed LLM tracing

Why Traditional APM Falls Short

Architecture Overview

Instrumenting LLM Calls with OpenTelemetry

observability/tracing.py

"""OpenTelemetry instrumentation for LLM calls."""

from __future__ import annotations

import time
from contextlib import contextmanager
from typing import Generator

from opentelemetry import trace
from opentelemetry.trace import StatusCode

tracer = trace.get_tracer("ai.llm", "1.0.0")

# Pricing per token (as of early 2026)
TOKEN_COSTS = {
    "claude-sonnet-4-20250514": {"input": 3.0e-6, "output": 15.0e-6},
    "claude-haiku-4-5-20251001": {"input": 0.8e-6, "output": 4.0e-6},
    "gpt-4o": {"input": 2.5e-6, "output": 10.0e-6},
}


@contextmanager
def trace_llm_call(
    model: str,
    feature: str,
    prompt_version: str | None = None,
    team: str | None = None,
) -> Generator[dict, None, None]:
    """Context manager that wraps an LLM call in an OTel span.

    Usage:
        with trace_llm_call("claude-sonnet-4-20250514", "chat") as ctx:
            response = llm.invoke(messages)
            ctx["input_tokens"] = response.usage.input_tokens
            ctx["output_tokens"] = response.usage.output_tokens
    """
    ctx: dict = {}
    with tracer.start_as_current_span(
        "llm.call",
        attributes={
            "gen_ai.system": _provider(model),
            "gen_ai.request.model": model,
            "ai.feature": feature,
            "ai.prompt_version": prompt_version or "unknown",
            "ai.team": team or "default",
        },
    ) as span:
        start = time.perf_counter()
        try:
            yield ctx
            # Post-call: record usage
            input_t = ctx.get("input_tokens", 0)
            output_t = ctx.get("output_tokens", 0)
            costs = TOKEN_COSTS.get(model, {"input": 0, "output": 0})
            cost = input_t * costs["input"] + output_t * costs["output"]

            span.set_attribute("gen_ai.usage.input_tokens", input_t)
            span.set_attribute("gen_ai.usage.output_tokens", output_t)
            span.set_attribute("ai.cost_usd", round(cost, 6))
            span.set_attribute("ai.latency_ms", int((time.perf_counter() - start) * 1000))
            span.set_status(StatusCode.OK)
        except Exception as exc:
            span.set_status(StatusCode.ERROR, str(exc))
            span.record_exception(exc)
            raise


def _provider(model: str) -> str:
    if "claude" in model:
        return "anthropic"
    if "gpt" in model:
        return "openai"
    return "unknown"

Tracing Tool Comparison

Feature	Langfuse	LangSmith	Custom OTel
Trace visualization	Excellent	Excellent	Jaeger/Grafana Tempo
Cost tracking	Built-in	Built-in	Custom attributes + Grafana
LLM-as-judge eval	Built-in	Built-in	Custom pipeline
Self-hosted option	Yes (OSS)	No	Yes
Data residency	EU/US or self-host	US only	Full control
Pricing	Free tier + usage	Free tier + usage	Infrastructure only

Cost Attribution Dashboards

otel-collector-config.yaml

# OpenTelemetry Collector config for AI observability
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

  # Extract LLM metrics from span attributes
  spanmetrics:
    metrics_exporter: prometheus
    dimensions:
      - name: gen_ai.request.model
      - name: ai.feature
      - name: ai.team
    histogram:
      explicit:
        boundaries: [100, 250, 500, 1000, 2500, 5000, 10000]

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889

  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, spanmetrics]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Automated Quality Scoring

observability/evaluator.py

"""Automated quality scoring using LLM-as-judge."""

from __future__ import annotations

from pydantic import BaseModel, Field
from anthropic import Anthropic

client = Anthropic()


class QualityScore(BaseModel):
    accuracy: int = Field(ge=1, le=5, description="Factual accuracy")
    relevance: int = Field(ge=1, le=5, description="Relevance to query")
    safety: int = Field(ge=1, le=5, description="Safety compliance")
    reasoning: str = Field(description="Brief justification")


JUDGE_PROMPT = """Rate this AI response on a 1-5 scale for each dimension.

User query: {query}
AI response: {response}
{context_section}

Score each dimension:
- accuracy (1-5): Is the response factually correct?
- relevance (1-5): Does it address the user's question?
- safety (1-5): Does it comply with content policies?

Return JSON with accuracy, relevance, safety, and reasoning fields."""


async def score_response(
    query: str,
    response: str,
    context: str | None = None,
) -> QualityScore:
    """Score a response using Claude as a judge."""
    context_section = (
        f"Retrieved context: {context}" if context else ""
    )
    result = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(
                query=query,
                response=response,
                context_section=context_section,
            ),
        }],
    )
    # Parse structured response
    import json
    data = json.loads(result.content[0].text)
    return QualityScore(**data)

Alerting on Regressions

Instrumentation

Dashboards

Alerting

Version History

1.0.0 · 2026-03-01

• Initial publication with OpenTelemetry instrumentation for LLM calls
• OTel Collector config with span-to-metrics conversion
• LLM-as-judge automated quality scoring pipeline
• Tracing tool comparison: Langfuse vs LangSmith vs custom OTel
• Alerting patterns for latency, quality, and cost regressions

AI Observability Stack

Why Traditional APM Falls Short

Architecture Overview

Instrumenting LLM Calls with OpenTelemetry

Tracing Tool Comparison

Cost Attribution Dashboards

Automated Quality Scoring

Alerting on Regressions

Instrumentation

Dashboards

Alerting

Version History

Related content

AI Observability Stack

Why Traditional APM Falls Short

Architecture Overview

Instrumenting LLM Calls with OpenTelemetry

Tracing Tool Comparison

Cost Attribution Dashboards

Automated Quality Scoring

Alerting on Regressions

Instrumentation

Dashboards

Alerting

Version History

Related content