Key Takeaway
By the end of this blueprint you will have an AI observability stack that captures distributed traces across LLM calls and tool invocations using OpenTelemetry, feeds cost attribution dashboards in Grafana, runs automated quality scoring with LLM-as-judge evaluators, and alerts on regressions before users notice.
Prerequisites
- An LLM application in production (or staging) generating real traffic
- Docker Compose for running the collector, Prometheus, and Grafana locally
- Python 3.11+ with the OpenTelemetry SDK installed
- Familiarity with distributed tracing concepts (traces, spans, attributes)
- Optional: a Langfuse or LangSmith account for managed LLM tracing
Why Traditional APM Falls Short
Traditional APM tools track request latency, error rates, and throughput. These are necessary but insufficient for AI applications. An LLM call can return HTTP 200 with a perfectly structured response that is factually wrong, off-brand, or unsafe. You need three additional metric dimensions: quality (is the output good?), cost (what did this call cost and who should pay for it?), and safety (does the output violate any policies?). AI observability layers these dimensions on top of standard infrastructure metrics.
Architecture Overview
The stack is built on OpenTelemetry for trace collection, with custom span attributes for LLM-specific metadata such as model name, token counts, and prompt versions. Traces flow into a collector that fans out to a time-series database for metrics, a search index for trace exploration, and an evaluation pipeline that periodically scores sampled outputs for quality and safety.
Instrumenting LLM Calls with OpenTelemetry
The instrumentation layer wraps every LLM call in an OpenTelemetry span with custom semantic attributes. These attributes follow the emerging OpenTelemetry GenAI semantic conventions: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and custom attributes for cost, prompt version, and feature identifier. This data flows through the standard OTel pipeline and can be consumed by any OTel-compatible backend.
"""OpenTelemetry instrumentation for LLM calls."""
from __future__ import annotations
import time
from contextlib import contextmanager
from typing import Generator
from opentelemetry import trace
from opentelemetry.trace import StatusCode
tracer = trace.get_tracer("ai.llm", "1.0.0")
# Pricing per token (as of early 2026)
TOKEN_COSTS = {
"claude-sonnet-4-20250514": {"input": 3.0e-6, "output": 15.0e-6},
"claude-haiku-4-5-20251001": {"input": 0.8e-6, "output": 4.0e-6},
"gpt-4o": {"input": 2.5e-6, "output": 10.0e-6},
}
@contextmanager
def trace_llm_call(
model: str,
feature: str,
prompt_version: str | None = None,
team: str | None = None,
) -> Generator[dict, None, None]:
"""Context manager that wraps an LLM call in an OTel span.
Usage:
with trace_llm_call("claude-sonnet-4-20250514", "chat") as ctx:
response = llm.invoke(messages)
ctx["input_tokens"] = response.usage.input_tokens
ctx["output_tokens"] = response.usage.output_tokens
"""
ctx: dict = {}
with tracer.start_as_current_span(
"llm.call",
attributes={
"gen_ai.system": _provider(model),
"gen_ai.request.model": model,
"ai.feature": feature,
"ai.prompt_version": prompt_version or "unknown",
"ai.team": team or "default",
},
) as span:
start = time.perf_counter()
try:
yield ctx
# Post-call: record usage
input_t = ctx.get("input_tokens", 0)
output_t = ctx.get("output_tokens", 0)
costs = TOKEN_COSTS.get(model, {"input": 0, "output": 0})
cost = input_t * costs["input"] + output_t * costs["output"]
span.set_attribute("gen_ai.usage.input_tokens", input_t)
span.set_attribute("gen_ai.usage.output_tokens", output_t)
span.set_attribute("ai.cost_usd", round(cost, 6))
span.set_attribute("ai.latency_ms", int((time.perf_counter() - start) * 1000))
span.set_status(StatusCode.OK)
except Exception as exc:
span.set_status(StatusCode.ERROR, str(exc))
span.record_exception(exc)
raise
def _provider(model: str) -> str:
if "claude" in model:
return "anthropic"
if "gpt" in model:
return "openai"
return "unknown"Tracing Tool Comparison
| Feature | Langfuse | LangSmith | Custom OTel |
|---|---|---|---|
| Trace visualization | Excellent | Excellent | Jaeger/Grafana Tempo |
| Cost tracking | Built-in | Built-in | Custom attributes + Grafana |
| LLM-as-judge eval | Built-in | Built-in | Custom pipeline |
| Self-hosted option | Yes (OSS) | No | Yes |
| Data residency | EU/US or self-host | US only | Full control |
| Pricing | Free tier + usage | Free tier + usage | Infrastructure only |
Cost Attribution Dashboards
Cost attribution is the killer feature of AI observability. By tagging every LLM span with team, feature, and prompt_version attributes, you can build Grafana dashboards that answer: which team spent the most this week? Which feature has the highest per-request cost? Did the latest prompt change increase or decrease token usage? These dashboards transform LLM cost from an opaque line item into an actionable engineering metric.
# OpenTelemetry Collector config for AI observability
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
# Extract LLM metrics from span attributes
spanmetrics:
metrics_exporter: prometheus
dimensions:
- name: gen_ai.request.model
- name: ai.feature
- name: ai.team
histogram:
explicit:
boundaries: [100, 250, 500, 1000, 2500, 5000, 10000]
exporters:
prometheus:
endpoint: 0.0.0.0:8889
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, spanmetrics]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]Automated Quality Scoring
Sample a percentage of production responses and score them automatically using an LLM-as-judge evaluator. The evaluator receives the original prompt, retrieved context (if applicable), and the generated response, then scores on dimensions like factual accuracy, instruction following, and tone. Store scores alongside the trace so you can correlate quality regressions with specific prompt changes, model updates, or retrieval pipeline modifications.
"""Automated quality scoring using LLM-as-judge."""
from __future__ import annotations
from pydantic import BaseModel, Field
from anthropic import Anthropic
client = Anthropic()
class QualityScore(BaseModel):
accuracy: int = Field(ge=1, le=5, description="Factual accuracy")
relevance: int = Field(ge=1, le=5, description="Relevance to query")
safety: int = Field(ge=1, le=5, description="Safety compliance")
reasoning: str = Field(description="Brief justification")
JUDGE_PROMPT = """Rate this AI response on a 1-5 scale for each dimension.
User query: {query}
AI response: {response}
{context_section}
Score each dimension:
- accuracy (1-5): Is the response factually correct?
- relevance (1-5): Does it address the user's question?
- safety (1-5): Does it comply with content policies?
Return JSON with accuracy, relevance, safety, and reasoning fields."""
async def score_response(
query: str,
response: str,
context: str | None = None,
) -> QualityScore:
"""Score a response using Claude as a judge."""
context_section = (
f"Retrieved context: {context}" if context else ""
)
result = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
messages=[{
"role": "user",
"content": JUDGE_PROMPT.format(
query=query,
response=response,
context_section=context_section,
),
}],
)
# Parse structured response
import json
data = json.loads(result.content[0].text)
return QualityScore(**data)Use a cheaper, faster model for judge evaluations (e.g., Claude Haiku) to keep scoring costs manageable. At a 10% sampling rate with Haiku, quality scoring adds roughly $0.001 per evaluated response. Score 100% of responses only during A/B tests or after prompt changes.
Alerting on Regressions
Configure alerts on three signal types: latency regressions (p95 LLM call latency exceeds baseline by more than 50%), quality regressions (average quality score drops below threshold over a rolling window), and cost spikes (hourly spend exceeds 2x the daily average). Each alert should include the affected feature, team, and model so the on-call engineer can triage quickly. Integrate alerts with your existing incident management system.
Instrumentation
Dashboards
Alerting
Version History
1.0.0 · 2026-03-01
- • Initial publication with OpenTelemetry instrumentation for LLM calls
- • OTel Collector config with span-to-metrics conversion
- • LLM-as-judge automated quality scoring pipeline
- • Tracing tool comparison: Langfuse vs LangSmith vs custom OTel
- • Alerting patterns for latency, quality, and cost regressions