Key Takeaway
AI monitoring ADRs should document both infrastructure metrics and model quality metrics, with explicit thresholds and escalation paths for each alert type.
When to Use This Template
Use this ADR when setting up monitoring for a new AI system, consolidating ad-hoc monitoring into a coherent strategy, or migrating between observability platforms. AI systems require monitoring dimensions that standard application monitoring does not cover: model quality drift, token consumption patterns, prompt performance, and safety filter activation rates. This template ensures all monitoring dimensions are addressed systematically.
ADR Template
# ADR: AI System Monitoring and Alerting Strategy
## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-XXX]
## Date
YYYY-MM-DD
## Decision Makers
- [Name, Role]
## Context
### System Components
- AI models in production: [list with traffic volume each]
- Infrastructure: [cloud provider, compute type, serving layer]
- On-call structure: [dedicated AI on-call | shared engineering rotation]
### Current Monitoring Gaps
- [e.g., "No model quality monitoring — only infrastructure metrics"]
- [e.g., "Alert fatigue from poorly calibrated thresholds"]
## Metric Taxonomy
### Infrastructure Metrics
| Metric | Source | Alert Threshold | Escalation |
|--------|--------|-----------------|------------|
| Request latency (p50, p95, p99) | [source] | [threshold] | [path] |
| Error rate | [source] | [threshold] | [path] |
| GPU/CPU utilization | [source] | [threshold] | [path] |
| Memory usage | [source] | [threshold] | [path] |
| Request queue depth | [source] | [threshold] | [path] |
### Model Quality Metrics
| Metric | Source | Alert Threshold | Escalation |
|--------|--------|-----------------|------------|
| Output quality score | [evaluation pipeline] | [threshold] | [path] |
| Safety filter activation rate | [safety layer] | [threshold] | [path] |
| Token usage per request | [usage tracking] | [threshold] | [path] |
| Cache hit rate | [cache layer] | [threshold] | [path] |
| User feedback signals | [feedback system] | [threshold] | [path] |
### Business Metrics
| Metric | Source | Alert Threshold | Escalation |
|--------|--------|-----------------|------------|
| Daily inference cost | [billing] | [threshold] | [path] |
| Feature adoption rate | [analytics] | [threshold] | [path] |
| User satisfaction score | [surveys/feedback] | [threshold] | [path] |
## Options Considered
| Criterion | Platform A | Platform B | Custom Build |
|-----------|-----------|-----------|--------------|
| AI-specific metrics support | | | |
| Integration with serving layer | | | |
| Alert routing and escalation | | | |
| Dashboard customization | | | |
| Cost at projected metric volume | | | |
| Team familiarity | | | |
## Decision
We will use [platform/approach] because [rationale].
### Dashboard Design
- Engineering dashboard: [real-time operational metrics]
- Leadership dashboard: [weekly/monthly business metrics]
- Incident dashboard: [focused view for on-call troubleshooting]
## Consequences
- Instrumentation effort: [estimated engineering time to add metrics]
- Alert tuning period: [expected time to calibrate thresholds]
- Ongoing cost: [monthly monitoring platform cost]
- Training: [team onboarding for new tools]
## Review Trigger
- [ ] False positive alert rate exceeds [threshold]%
- [ ] Missed incident (alert should have fired but didn't)
- [ ] Monthly monitoring cost exceeds [threshold]
- [ ] New AI system component requires monitoringSection-by-Section Guidance
Metric Taxonomy
The three-tier metric taxonomy (infrastructure, model quality, business) ensures monitoring coverage across all dimensions. Infrastructure metrics catch operational issues. Model quality metrics catch degradation that infrastructure metrics miss, such as a model returning valid HTTP responses with poor content quality. Business metrics catch issues that neither operational nor quality metrics surface, such as declining user engagement despite stable technical metrics.
Alert Threshold Calibration
Set initial alert thresholds based on baseline measurements from at least two weeks of production data. Avoid setting thresholds too tight in the beginning, as this leads to alert fatigue that causes the team to ignore all alerts. It is better to start with wider thresholds and tighten them as you understand normal variance. Document the expected false positive rate for each alert and review it monthly.
Create separate dashboards for different audiences. Engineers need real-time operational metrics with granular drill-down. Leadership needs weekly trends with business impact context. On-call needs a focused incident view with runbook links. A single dashboard trying to serve all audiences serves none of them well.
Model quality monitoring requires a continuous evaluation pipeline, not just logging. Logging raw outputs is necessary for debugging but insufficient for detecting quality drift. Implement automated quality scoring on a sample of production responses and alert when scores trend downward.
Version History
1.0.0 · 2026-03-01
- • Initial ADR template for monitoring and alerting strategy