Key Takeaway
The most critical skill for AI on-call is quickly distinguishing between infrastructure failures and model quality degradation, because they require different response teams and different remediation approaches. This playbook provides structured runbooks, diagnostic decision trees, and escalation paths for the five most common categories of AI system alerts.
Prerequisites
- An existing on-call rotation and incident management process
- Model monitoring infrastructure with alerting configured (see: Model Monitoring Playbook)
- Access to model serving logs, metrics dashboards, and deployment tooling
- Model rollback capability (the ability to revert to a previous model version within minutes)
- Contact information for the ML engineering team and data engineering team for escalations
AI On-Call Is Different
Traditional on-call engineers diagnose infrastructure failures: services are down, databases are slow, network is partitioned. AI on-call adds a category that does not exist in traditional systems: the service is up and responding, but the responses are wrong. An inference endpoint returning HTTP 200 with a JSON response that contains a subtly incorrect prediction looks healthy to every standard monitoring tool. Diagnosing these silent failures requires understanding model behavior, data distributions, and quality metrics that most on-call engineers have never worked with.
This playbook bridges the gap by providing structured runbooks that guide on-call engineers through AI-specific diagnosis without requiring deep ML expertise. The runbooks use a triage-first approach: determine whether the issue is infrastructure (on-call can resolve), data pipeline (escalate to data engineering), or model quality (escalate to ML team), and then follow the appropriate resolution path.
Triage Decision Tree
When an AI-related alert fires, the on-call engineer should follow this triage sequence to classify the issue and route to the right response path. The goal is to make a classification decision within ten minutes and escalate to the right team without wasting time investigating issues outside your expertise.
- 1
Check 1: Is the service responding?
If health checks are failing, error rates are elevated, or latency exceeds SLA, this is an infrastructure issue. Follow standard infrastructure runbooks: check pod health, GPU allocation, memory usage, and network connectivity. On-call can usually resolve this.
- 2
Check 2: Is the data pipeline healthy?
If features are stale, null rates have spiked, or schema validation is failing, this is a data pipeline issue. Check feature store freshness timestamps, upstream pipeline job status, and data source availability. Escalate to data engineering if the pipeline is broken.
- 3
Check 3: Is the model producing quality outputs?
If accuracy metrics have dropped, drift alerts are firing, or confidence distributions have shifted, this is a model quality issue. Check whether a recent model deployment or data change correlates with the degradation. Escalate to the ML team.
- 4
Check 4: Is the cost anomalous?
If token usage has spiked, GPU utilization is abnormally high, or API costs have jumped, investigate the traffic source. Check for traffic spikes, prompt injection attacks, or misconfigured batch jobs. May require on-call resolution or security escalation.
Runbook: Inference Failures
Inference failures include timeout spikes, error rate increases, GPU out-of-memory errors, and model loading failures. These are the most familiar alert category for traditional on-call engineers because they resemble standard infrastructure issues.
# Runbook: Inference Timeout / Error Spike
# Trigger: p99 latency > 2x baseline OR error rate > 1%
name: inference-failure-runbook
severity: SEV-2 (auto-escalate to SEV-1 if > 5 min)
diagnosis:
- step: Check GPU utilization
command: kubectl top pods -n ml-serving --containers
expected: GPU utilization < 90%
if_exceeded: Scale up replicas or investigate stuck requests
- step: Check for OOM errors
command: kubectl logs -n ml-serving -l app=model-inference --tail=100 | grep -i "out of memory"
expected: No OOM events
if_found: Restart affected pods; consider GPU upgrade if recurring
- step: Check model loading status
command: curl http://model-service:8080/health
expected: "model_loaded: true"
if_not_loaded: Check model artifact availability; verify model registry connectivity
- step: Check upstream dependencies
command: curl http://feature-store:8080/health
expected: "status: healthy"
if_unhealthy: Feature store may be causing inference delays
mitigation:
immediate:
- Scale inference replicas if GPU utilization > 90%
- Restart pods showing OOM errors
- Enable request queuing if not already active
if_unresolved_after_15_min:
- Rollback to previous model version
- Enable fallback model (lighter model with degraded quality)
- Page ML engineering team lead
escalation:
after: 15 minutes without resolution
to: ML engineering team lead
include: Grafana dashboard URL, error logs, timeline of actions takenRunbook: Model Quality Degradation
Quality degradation is the most challenging alert category for on-call because it requires understanding what 'correct' model behavior looks like. The runbook for quality alerts focuses on gathering diagnostic information and making a rollback decision rather than fixing the root cause, which typically requires ML team investigation.
The key question for on-call is: should I rollback the model? The answer is yes if the quality degradation is severe (accuracy below SLA), if the degradation correlates with a recent deployment, or if the outputs are harmful (discriminatory, toxic, or leaking PII). The answer is 'gather more data' if the degradation is moderate and does not correlate with a recent change, as it may be caused by data drift rather than a model issue.
Runbook: LLM-Specific Issues
LLM-powered features have unique failure modes that require specialized runbooks. Hallucination spikes, safety filter triggers, provider rate limiting, and prompt injection attacks all require responses that do not map to traditional runbooks. The LLM runbook provides specific diagnostic steps for each of these scenarios.
| LLM Issue | Diagnosis | Immediate Action | Escalation |
|---|---|---|---|
| Hallucination spike | Check if model version changed. Compare recent outputs against ground truth. Check if RAG retrieval quality degraded. | If correlated with model update: rollback. If RAG issue: check vector store health. If neither: enable output validation layer. | ML team for root cause analysis |
| Provider rate limiting | Check API usage dashboard. Identify traffic source (legitimate spike vs. attack). | Enable request queuing. Activate secondary provider if configured. Throttle low-priority features. | On-call can usually resolve. Escalate if traffic source is unknown. |
| Prompt injection detected | Review flagged inputs in security logs. Check if outputs were affected. Assess data exposure risk. | Block the source if identifiable. Review outputs for PII leakage. Enable enhanced input filtering. | Security team + ML team. If PII was leaked: SEV-1. |
| Cost spike | Check token usage per endpoint. Identify which prompts are consuming most tokens. Check for recursive or loop behaviors. | Set per-request token limits. Disable affected feature via feature flag if cost is runaway. Check for batch job misconfigurations. | Engineering manager for budget decisions |
Create a 'model rollback' command alias that every on-call engineer can run without looking up documentation. The rollback command should revert to the last known good model version, log the rollback event, and notify the ML team. Making rollback easy ensures that on-call engineers actually do it when needed rather than waiting for the ML team to become available.
Version History
1.0.0 · 2026-03-01
- • Initial release with triage decision tree and five runbook categories
- • YAML-based inference failure runbook with step-by-step diagnosis
- • LLM-specific issue comparison table with diagnosis and response actions
- • Model rollback guidance for on-call engineers
- • Readiness checklist for AI on-call program