Key Takeaway
Deployment target decisions should be revisited as model sizes, traffic patterns, and cloud pricing evolve, so building in a review trigger is essential.
When to Use This Template
Use this ADR when deploying a new AI model to production, re-evaluating deployment strategy for existing models, or assessing whether to move workloads between cloud, edge, and on-premises environments. This decision has significant cost and latency implications that compound over time, making it worth documenting rigorously even when the initial choice seems obvious.
ADR Template
# ADR: Deployment Target for AI Models
## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-XXX]
## Date
YYYY-MM-DD
## Decision Makers
- [Name, Role]
## Context
### Workload Characteristics
- Model type: [e.g., "LLM inference via API", "custom classifier", "embedding model"]
- Model size: [e.g., "API-based (provider-managed)", "7B parameters", "300M parameters"]
- Inference pattern: [synchronous request/response | batch | streaming]
- Traffic profile: [steady | bursty | time-of-day patterns]
### Requirements
- Latency SLA: [e.g., "p99 < 500ms for classification, p99 < 5s for generation"]
- Throughput: [e.g., "200 requests/minute sustained, 1000/minute peak"]
- Data residency: [e.g., "EU data must stay in EU", "no restrictions"]
- Availability SLA: [e.g., "99.9% uptime"]
- Compliance: [e.g., "HIPAA", "SOC 2", "FedRAMP"]
### Team Capabilities
- Infrastructure experience: [managed cloud only | Kubernetes | bare metal]
- On-call capacity: [dedicated infra on-call | shared rotation | none]
- GPU expertise: [none | basic | production experience]
## Options Considered
| Criterion | Cloud API | Self-Hosted Cloud | Edge | On-Premises |
|-----------|-----------|-------------------|------|-------------|
| Latency (typical) | | | | |
| Cost model | | | | |
| Data control | | | | |
| Scaling flexibility | | | | |
| Operational burden | | | | |
| GPU management | | | | |
| Cold start behavior | | | | |
| Compliance fit | | | | |
## Evaluation Criteria (Weighted)
1. **Latency at scale** (25%) — Meeting SLA under peak load
2. **Total cost of ownership** (25%) — Compute + operations + team time
3. **Data control** (20%) — Residency, privacy, audit requirements
4. **Operational simplicity** (15%) — Deployment, scaling, monitoring
5. **Flexibility** (15%) — Ability to adapt as requirements change
## Decision
We will deploy to [target] because [rationale].
## Consequences
- Infrastructure commitment: [e.g., "reserved GPU instances for 12 months"]
- Team requirements: [e.g., "need to hire or upskill for Kubernetes GPU management"]
- Cost projection: [monthly cost estimate with assumptions]
- Migration path: [how to move to a different target if needed]
## Review Trigger
- [ ] Monthly traffic exceeds [threshold] requests
- [ ] Monthly cost exceeds [threshold]
- [ ] Latency SLA breach rate exceeds [threshold]
- [ ] New compliance requirement introduced
- [ ] Contract renewal for cloud provider approachesSection-by-Section Guidance
Team Capabilities
This is the most frequently underweighted section in deployment target ADRs. A team with no GPU management experience choosing self-hosted deployment will face months of operational learning that delays the project. Be honest about your team's current capabilities and factor the ramp-up time into your timeline. If the decision requires capabilities your team does not have, the ADR should explicitly state whether you plan to hire, train, or use managed services to close that gap.
Cost Modeling
Cloud API pricing is straightforward to model but can be surprising at scale. Self-hosted costs include GPU instances, storage, networking, and the engineering time to maintain the infrastructure. Many teams underestimate the operational cost by focusing only on compute. For a fair comparison, include the fully loaded cost of engineer hours spent on infrastructure maintenance, monitoring, and incident response.
Start with cloud APIs and set a cost threshold review trigger. This gets you to production fastest and gives you real traffic data to inform a self-hosting decision later. The cost of premature optimization (delayed launch, operational burden) almost always exceeds the savings from lower per-inference cost.
Edge deployment adds significant complexity around model updates, monitoring, and debugging. Only choose edge when latency or data residency requirements make it strictly necessary, not because it seems technically interesting.
Version History
1.0.0 · 2026-03-01
- • Initial ADR template for deployment target selection