Key Takeaway
The biggest risk with AI incidents is detection latency. A model producing plausible but incorrect outputs can go undetected for days. This playbook defines AI-specific severity levels, detection strategies, structured escalation paths, root cause analysis frameworks, and communication templates that reduce mean time to detection from days to minutes.
Prerequisites
- An existing incident management framework (PagerDuty, Opsgenie, or equivalent)
- AI model monitoring infrastructure with drift detection and quality alerting
- Defined SLAs for model accuracy, latency, and availability
- On-call rotation that includes engineers with ML system experience
- Model versioning and rollback capabilities in your deployment pipeline
AI Incidents Are Different
Traditional software incidents have clear symptoms: the service returns errors, latency spikes, or the health check fails. AI incidents are fundamentally different because the system can be operationally healthy -- serving responses within latency SLAs with zero errors -- while producing outputs that are subtly wrong. A recommendation model that starts surfacing irrelevant content, a credit scoring model whose accuracy has drifted below acceptable thresholds, or an LLM that begins hallucinating facts that pass superficial plausibility checks. All of these are incidents, but none of them trigger traditional monitoring alerts.
This asymmetry means AI incident response requires a fundamentally different detection philosophy. Instead of monitoring system health (is it up?), you must monitor system correctness (is it right?). And because correctness is harder to measure than availability, AI incident detection requires purpose-built monitoring layers, sample-based quality evaluation, and feedback loops from downstream consumers of model outputs.
Severity Classification
AI incidents require a severity classification system calibrated to AI-specific failure modes. Traditional SEV-1 through SEV-4 definitions based on user impact and revenue loss still apply, but must be extended with dimensions for model quality degradation, data integrity compromise, and compliance violation severity.
| Severity | Definition | Detection | Response Time | Escalation |
|---|---|---|---|---|
| SEV-1: Critical | Model producing harmful, discriminatory, or dangerous outputs. PII leakage detected. Complete model failure in production. | Output safety monitoring, user reports, compliance alerts | Immediate (< 15 min) | VP Engineering + Legal + CISO. War room within 30 min. |
| SEV-2: High | Model accuracy below SLA threshold. Significant hallucination increase. Data pipeline producing corrupted features. | Accuracy monitoring, drift alerts, pipeline health checks | Within 1 hour | Engineering Director + ML Lead. Incident channel within 2 hours. |
| SEV-3: Medium | Moderate accuracy degradation within SLA but trending downward. Latency degradation. Feature staleness. | Trend monitoring, weekly evaluation reports | Within 4 hours (business hours) | ML team lead. Tracked in incident system. |
| SEV-4: Low | Minor quality fluctuations within normal bounds. Non-critical monitoring gap identified. Documentation outdated. | Regular model evaluations, audit reviews | Next business day | Assigned to backlog. Reviewed in weekly standup. |
Detection Strategies
Detection is the hardest part of AI incident response. Unlike infrastructure alerts where the signal is clear, AI quality degradation often presents as a gradual shift rather than a sudden failure. The detection strategy must combine automated statistical monitoring, structured human review, and user feedback channels.
"""AI incident detection framework.
Monitors model outputs for quality degradation, drift,
and safety violations. Designed to run as a background
service alongside your inference pipeline.
"""
from dataclasses import dataclass
from enum import Enum
from typing import List, Optional, Callable
import time
class IncidentSeverity(Enum):
CRITICAL = 1
HIGH = 2
MEDIUM = 3
LOW = 4
@dataclass
class DetectionResult:
"""Result of an incident detection check."""
detected: bool
severity: Optional[IncidentSeverity]
detector_name: str
message: str
metric_value: float
threshold: float
timestamp: float
class AIIncidentDetector:
"""Composite detector that runs multiple checks
against model outputs and triggers alerts.
"""
def __init__(self):
self.detectors: List[Callable] = []
self.alert_handlers: List[Callable] = []
def register_detector(self, fn: Callable) -> None:
"""Register a detection function."""
self.detectors.append(fn)
def register_alert_handler(self, fn: Callable) -> None:
"""Register a handler for detected incidents."""
self.alert_handlers.append(fn)
def check(self, outputs: list, labels: list = None) -> List[DetectionResult]:
"""Run all registered detectors against recent outputs."""
results = []
for detector in self.detectors:
result = detector(outputs, labels)
if result.detected:
for handler in self.alert_handlers:
handler(result)
results.append(result)
return results
def confidence_drop_detector(
outputs: list,
labels: list = None,
window_size: int = 100,
threshold: float = 0.7,
) -> DetectionResult:
"""Detect when average model confidence drops below threshold."""
if not outputs:
return DetectionResult(
detected=False,
severity=None,
detector_name="confidence_drop",
message="No outputs to evaluate",
metric_value=0.0,
threshold=threshold,
timestamp=time.time(),
)
recent = outputs[-window_size:]
avg_confidence = sum(
o.get("confidence", 0) for o in recent
) / len(recent)
return DetectionResult(
detected=avg_confidence < threshold,
severity=IncidentSeverity.HIGH if avg_confidence < threshold else None,
detector_name="confidence_drop",
message=f"Average confidence {avg_confidence:.3f} vs threshold {threshold}",
metric_value=avg_confidence,
threshold=threshold,
timestamp=time.time(),
)Escalation Matrix
AI incidents require escalation paths that are different from standard software incidents because they often require specialists who are not part of the standard on-call rotation. Infrastructure failures go to the platform team. Model quality issues go to the ML engineering team. Fairness or compliance violations go to the ethics committee and legal. The escalation matrix must map each incident type to the right response team without introducing delays caused by misrouting.
- 1
Step 1: Triage (0-15 minutes)
On-call engineer receives alert. Classify as infrastructure issue vs. model quality issue vs. safety/compliance issue. If infrastructure, follow standard incident process. If model quality or safety, proceed to Step 2.
- 2
Step 2: Containment (15-60 minutes)
Assess blast radius. If model is producing harmful outputs, immediately roll back to the last known good version or activate the fallback model. If accuracy has degraded but outputs are not harmful, set up shadow monitoring and proceed to investigation.
- 3
Step 3: Investigation (1-4 hours)
ML engineer investigates root cause. Check for data pipeline changes, feature drift, upstream data quality issues, model version mismatches, and infrastructure changes. Correlate incident timing with recent deployments.
- 4
Step 4: Remediation (hours to days)
Implement fix based on root cause. If data issue: fix pipeline and retrain. If model drift: retrain on updated data. If adversarial attack: patch input validation and deploy hardened model. If infrastructure: restore and add monitoring.
- 5
Step 5: Post-Incident Review (within 5 business days)
Conduct blameless post-incident review. Document timeline, root cause, impact assessment, and prevention actions. Update detection rules and runbooks based on learnings. Share findings with the broader engineering organization.
Root Cause Analysis for AI
Root cause analysis for AI incidents must investigate dimensions that traditional RCA does not cover. Beyond the standard infrastructure and code change analysis, AI RCA must examine data pipeline changes (did the distribution of input features shift?), upstream system changes (did a dependent service change its output format or behavior?), model-data interaction effects (did a rare data pattern trigger an edge case?), and temporal effects (did seasonality or a real-world event change the relationship between features and outcomes?).
The most insidious AI incidents are caused by gradual data drift rather than sudden failures. These are often missed by RCA processes designed for point-in-time failure analysis. Always include a data distribution comparison between the incident window and the baseline period as a standard RCA step.
Communication Templates
AI incidents require carefully crafted communication because stakeholders often do not understand the probabilistic nature of model quality. Saying 'model accuracy dropped from 92% to 85%' is technically accurate but unhelpful to a product manager who needs to know whether users were affected. Communication templates should translate technical metrics into business impact language: how many users were affected, what decisions may have been impacted, and what the remediation timeline is.
## AI Incident Report: [INCIDENT-ID]
**Severity**: SEV-[1/2/3/4]
**Status**: [Investigating / Contained / Resolved]
**Duration**: [Start time] - [End time or ongoing]
**Affected System**: [Model name and version]
### Impact Summary
- **Users affected**: [Number or percentage]
- **Decisions impacted**: [Description of affected outputs]
- **Business impact**: [Revenue, user experience, compliance]
### Timeline
- [HH:MM] Alert triggered by [detection mechanism]
- [HH:MM] On-call engineer acknowledged
- [HH:MM] [Containment action taken]
- [HH:MM] Root cause identified
- [HH:MM] Fix deployed / Model rolled back
### Root Cause
[One-paragraph description of what went wrong and why]
### Remediation
- **Immediate**: [Actions already taken]
- **Short-term**: [Actions planned this week]
- **Long-term**: [Systemic improvements planned]
### Prevention
- [ ] [Monitoring improvement]
- [ ] [Testing gap closure]
- [ ] [Process change]Version History
1.0.0 · 2026-03-01
- • Initial release with AI-specific severity classification system
- • Detection framework code example with extensible detector registration
- • Five-step escalation and response process
- • Root cause analysis guidance for data drift and model quality incidents
- • Communication templates for stakeholder updates
- • Incident response readiness checklist