Key Takeaway

The most critical skill for AI on-call is quickly distinguishing between infrastructure failures and model quality degradation, because they require different response teams and different remediation approaches. This playbook provides structured runbooks, diagnostic decision trees, and escalation paths for the five most common categories of AI system alerts.

Prerequisites

An existing on-call rotation and incident management process
Model monitoring infrastructure with alerting configured (see: Model Monitoring Playbook)
Access to model serving logs, metrics dashboards, and deployment tooling
Model rollback capability (the ability to revert to a previous model version within minutes)
Contact information for the ML engineering team and data engineering team for escalations

AI On-Call Is Different

Traditional on-call engineers diagnose infrastructure failures: services are down, databases are slow, network is partitioned. AI on-call adds a category that does not exist in traditional systems: the service is up and responding, but the responses are wrong. An inference endpoint returning HTTP 200 with a JSON response that contains a subtly incorrect prediction looks healthy to every standard monitoring tool. Diagnosing these silent failures requires understanding model behavior, data distributions, and quality metrics that most on-call engineers have never worked with.

This playbook bridges the gap by providing structured runbooks that guide on-call engineers through AI-specific diagnosis without requiring deep ML expertise. The runbooks use a triage-first approach: determine whether the issue is infrastructure (on-call can resolve), data pipeline (escalate to data engineering), or model quality (escalate to ML team), and then follow the appropriate resolution path.

Triage Decision Tree

When an AI-related alert fires, the on-call engineer should follow this triage sequence to classify the issue and route to the right response path. The goal is to make a classification decision within ten minutes and escalate to the right team without wasting time investigating issues outside your expertise.

1
Check 1: Is the service responding?
If health checks are failing, error rates are elevated, or latency exceeds SLA, this is an infrastructure issue. Follow standard infrastructure runbooks: check pod health, GPU allocation, memory usage, and network connectivity. On-call can usually resolve this.
2
Check 2: Is the data pipeline healthy?
If features are stale, null rates have spiked, or schema validation is failing, this is a data pipeline issue. Check feature store freshness timestamps, upstream pipeline job status, and data source availability. Escalate to data engineering if the pipeline is broken.
3
Check 3: Is the model producing quality outputs?
If accuracy metrics have dropped, drift alerts are firing, or confidence distributions have shifted, this is a model quality issue. Check whether a recent model deployment or data change correlates with the degradation. Escalate to the ML team.
4
Check 4: Is the cost anomalous?
If token usage has spiked, GPU utilization is abnormally high, or API costs have jumped, investigate the traffic source. Check for traffic spikes, prompt injection attacks, or misconfigured batch jobs. May require on-call resolution or security escalation.

Runbook: Inference Failures

Inference failures include timeout spikes, error rate increases, GPU out-of-memory errors, and model loading failures. These are the most familiar alert category for traditional on-call engineers because they resemble standard infrastructure issues.

runbook-inference-failure.yaml

# Runbook: Inference Timeout / Error Spike
# Trigger: p99 latency > 2x baseline OR error rate > 1%

name: inference-failure-runbook
severity: SEV-2 (auto-escalate to SEV-1 if > 5 min)

diagnosis:
  - step: Check GPU utilization
    command: kubectl top pods -n ml-serving --containers
    expected: GPU utilization < 90%
    if_exceeded: Scale up replicas or investigate stuck requests

  - step: Check for OOM errors
    command: kubectl logs -n ml-serving -l app=model-inference --tail=100 | grep -i "out of memory"
    expected: No OOM events
    if_found: Restart affected pods; consider GPU upgrade if recurring

  - step: Check model loading status
    command: curl http://model-service:8080/health
    expected: "model_loaded: true"
    if_not_loaded: Check model artifact availability; verify model registry connectivity

  - step: Check upstream dependencies
    command: curl http://feature-store:8080/health
    expected: "status: healthy"
    if_unhealthy: Feature store may be causing inference delays

mitigation:
  immediate:
    - Scale inference replicas if GPU utilization > 90%
    - Restart pods showing OOM errors
    - Enable request queuing if not already active

  if_unresolved_after_15_min:
    - Rollback to previous model version
    - Enable fallback model (lighter model with degraded quality)
    - Page ML engineering team lead

escalation:
  after: 15 minutes without resolution
  to: ML engineering team lead
  include: Grafana dashboard URL, error logs, timeline of actions taken

Runbook: Model Quality Degradation

Quality degradation is the most challenging alert category for on-call because it requires understanding what 'correct' model behavior looks like. The runbook for quality alerts focuses on gathering diagnostic information and making a rollback decision rather than fixing the root cause, which typically requires ML team investigation.

The key question for on-call is: should I rollback the model? The answer is yes if the quality degradation is severe (accuracy below SLA), if the degradation correlates with a recent deployment, or if the outputs are harmful (discriminatory, toxic, or leaking PII). The answer is 'gather more data' if the degradation is moderate and does not correlate with a recent change, as it may be caused by data drift rather than a model issue.

Runbook: LLM-Specific Issues

LLM-powered features have unique failure modes that require specialized runbooks. Hallucination spikes, safety filter triggers, provider rate limiting, and prompt injection attacks all require responses that do not map to traditional runbooks. The LLM runbook provides specific diagnostic steps for each of these scenarios.

LLM Issue	Diagnosis	Immediate Action	Escalation
Hallucination spike	Check if model version changed. Compare recent outputs against ground truth. Check if RAG retrieval quality degraded.	If correlated with model update: rollback. If RAG issue: check vector store health. If neither: enable output validation layer.	ML team for root cause analysis
Provider rate limiting	Check API usage dashboard. Identify traffic source (legitimate spike vs. attack).	Enable request queuing. Activate secondary provider if configured. Throttle low-priority features.	On-call can usually resolve. Escalate if traffic source is unknown.
Prompt injection detected	Review flagged inputs in security logs. Check if outputs were affected. Assess data exposure risk.	Block the source if identifiable. Review outputs for PII leakage. Enable enhanced input filtering.	Security team + ML team. If PII was leaked: SEV-1.
Cost spike	Check token usage per endpoint. Identify which prompts are consuming most tokens. Check for recursive or loop behaviors.	Set per-request token limits. Disable affected feature via feature flag if cost is runaway. Check for batch job misconfigurations.	Engineering manager for budget decisions

Create a 'model rollback' command alias that every on-call engineer can run without looking up documentation. The rollback command should revert to the last known good model version, log the rollback event, and notify the ML team. Making rollback easy ensures that on-call engineers actually do it when needed rather than waiting for the ML team to become available.

0/8 completed

Version History

1.0.0 · 2026-03-01

• Initial release with triage decision tree and five runbook categories
• YAML-based inference failure runbook with step-by-step diagnosis
• LLM-specific issue comparison table with diagnosis and response actions
• Model rollback guidance for on-call engineers
• Readiness checklist for AI on-call program

Key Takeaway

Prerequisites

An existing on-call rotation and incident management process
Model monitoring infrastructure with alerting configured (see: Model Monitoring Playbook)
Access to model serving logs, metrics dashboards, and deployment tooling
Model rollback capability (the ability to revert to a previous model version within minutes)
Contact information for the ML engineering team and data engineering team for escalations

AI On-Call Is Different

Triage Decision Tree

1
Check 1: Is the service responding?
If health checks are failing, error rates are elevated, or latency exceeds SLA, this is an infrastructure issue. Follow standard infrastructure runbooks: check pod health, GPU allocation, memory usage, and network connectivity. On-call can usually resolve this.
2
Check 2: Is the data pipeline healthy?
If features are stale, null rates have spiked, or schema validation is failing, this is a data pipeline issue. Check feature store freshness timestamps, upstream pipeline job status, and data source availability. Escalate to data engineering if the pipeline is broken.
3
Check 3: Is the model producing quality outputs?
If accuracy metrics have dropped, drift alerts are firing, or confidence distributions have shifted, this is a model quality issue. Check whether a recent model deployment or data change correlates with the degradation. Escalate to the ML team.
4
Check 4: Is the cost anomalous?
If token usage has spiked, GPU utilization is abnormally high, or API costs have jumped, investigate the traffic source. Check for traffic spikes, prompt injection attacks, or misconfigured batch jobs. May require on-call resolution or security escalation.

Runbook: Inference Failures

runbook-inference-failure.yaml

# Runbook: Inference Timeout / Error Spike
# Trigger: p99 latency > 2x baseline OR error rate > 1%

name: inference-failure-runbook
severity: SEV-2 (auto-escalate to SEV-1 if > 5 min)

diagnosis:
  - step: Check GPU utilization
    command: kubectl top pods -n ml-serving --containers
    expected: GPU utilization < 90%
    if_exceeded: Scale up replicas or investigate stuck requests

  - step: Check for OOM errors
    command: kubectl logs -n ml-serving -l app=model-inference --tail=100 | grep -i "out of memory"
    expected: No OOM events
    if_found: Restart affected pods; consider GPU upgrade if recurring

  - step: Check model loading status
    command: curl http://model-service:8080/health
    expected: "model_loaded: true"
    if_not_loaded: Check model artifact availability; verify model registry connectivity

  - step: Check upstream dependencies
    command: curl http://feature-store:8080/health
    expected: "status: healthy"
    if_unhealthy: Feature store may be causing inference delays

mitigation:
  immediate:
    - Scale inference replicas if GPU utilization > 90%
    - Restart pods showing OOM errors
    - Enable request queuing if not already active

  if_unresolved_after_15_min:
    - Rollback to previous model version
    - Enable fallback model (lighter model with degraded quality)
    - Page ML engineering team lead

escalation:
  after: 15 minutes without resolution
  to: ML engineering team lead
  include: Grafana dashboard URL, error logs, timeline of actions taken

Runbook: Model Quality Degradation

Runbook: LLM-Specific Issues

LLM Issue	Diagnosis	Immediate Action	Escalation
Hallucination spike	Check if model version changed. Compare recent outputs against ground truth. Check if RAG retrieval quality degraded.	If correlated with model update: rollback. If RAG issue: check vector store health. If neither: enable output validation layer.	ML team for root cause analysis
Provider rate limiting	Check API usage dashboard. Identify traffic source (legitimate spike vs. attack).	Enable request queuing. Activate secondary provider if configured. Throttle low-priority features.	On-call can usually resolve. Escalate if traffic source is unknown.
Prompt injection detected	Review flagged inputs in security logs. Check if outputs were affected. Assess data exposure risk.	Block the source if identifiable. Review outputs for PII leakage. Enable enhanced input filtering.	Security team + ML team. If PII was leaked: SEV-1.
Cost spike	Check token usage per endpoint. Identify which prompts are consuming most tokens. Check for recursive or loop behaviors.	Set per-request token limits. Disable affected feature via feature flag if cost is runaway. Check for batch job misconfigurations.	Engineering manager for budget decisions

0/8 completed

Version History

1.0.0 · 2026-03-01

• Initial release with triage decision tree and five runbook categories
• YAML-based inference failure runbook with step-by-step diagnosis
• LLM-specific issue comparison table with diagnosis and response actions
• Model rollback guidance for on-call engineers
• Readiness checklist for AI on-call program

On-Call Playbook for AI Systems

AI On-Call Is Different

Triage Decision Tree

Check 1: Is the service responding?

Check 2: Is the data pipeline healthy?

Check 3: Is the model producing quality outputs?

Check 4: Is the cost anomalous?

Runbook: Inference Failures

Runbook: Model Quality Degradation

Runbook: LLM-Specific Issues

Version History

Related content

On-Call Playbook for AI Systems

AI On-Call Is Different

Triage Decision Tree

Check 1: Is the service responding?

Check 2: Is the data pipeline healthy?

Check 3: Is the model producing quality outputs?

Check 4: Is the cost anomalous?

Runbook: Inference Failures

Runbook: Model Quality Degradation

Runbook: LLM-Specific Issues

Version History

Related content