Key Takeaway
Model serving architecture decisions have long-term operational implications, so documenting auto-scaling parameters and failover behavior upfront prevents production surprises.
When to Use This Template
Use this ADR when deploying ML models to production for the first time, re-architecting an existing serving layer, or adding new models to an existing serving infrastructure. Model serving decisions affect latency, cost, reliability, and deployment velocity. This template helps teams compare managed, self-hosted, and serverless options with a structured evaluation.
ADR Template
# ADR: Model Serving Architecture
## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-XXX]
## Date
YYYY-MM-DD
## Decision Makers
- [Name, Role]
## Context
### Models to Serve
| Model | Type | Size | Latency SLA | Traffic Pattern |
|-------|------|------|-------------|-----------------|
| [Model 1] | [LLM/Classifier/Embedding] | [params or API] | [p99 target] | [steady/bursty] |
| [Model 2] | | | | |
### Requirements
- Deployment frequency: [e.g., "weekly model updates"]
- A/B testing: [required for model comparison | not needed]
- Multi-model routing: [serve multiple model versions simultaneously | single version]
- Auto-scaling: [scale to zero acceptable | minimum instance count required]
- GPU requirements: [GPU type and count, or API-only]
## Options Considered
| Criterion | Managed Endpoint | Custom Container | Serverless | API Gateway |
|-----------|-----------------|------------------|------------|-------------|
| Cold start latency | None | None | High | None |
| Scaling speed | Medium | Custom | Fast | N/A |
| Cost at low traffic | High (min instances) | High | Low | Per-request |
| Cost at high traffic | Moderate | Low | High | Per-request |
| GPU access | Managed | Full control | Limited | N/A |
| Deployment speed | Fast | Medium | Fast | N/A |
| Monitoring | Built-in | Custom | Limited | Provider |
| A/B testing | Some support | Custom | Limited | Easy |
## Decision
We will use [architecture] because [rationale].
### Configuration
- Instance type: [e.g., "g5.xlarge" or "API-based"]
- Auto-scaling policy: [min/max instances, scaling metric, cooldown]
- Health check: [endpoint, interval, failure threshold]
- Deployment strategy: [blue-green | canary | rolling]
## Consequences
- Monthly cost estimate: [$X at projected traffic]
- Operational requirements: [monitoring, on-call, maintenance]
- Deployment velocity: [time from model ready to production]
- Limitations: [known scaling limits, feature gaps]
## Review Trigger
- [ ] Monthly traffic exceeds [threshold] requests
- [ ] Model size increases beyond current instance capacity
- [ ] Latency SLA breach rate exceeds [threshold]%
- [ ] Monthly serving cost exceeds [threshold]Section-by-Section Guidance
Multi-Model Considerations
If you serve multiple models, document whether they share infrastructure or each have dedicated resources. Shared serving (e.g., a multi-model endpoint) reduces cost but creates resource contention risk. Dedicated serving provides isolation but increases infrastructure cost and management overhead. For most teams, start with dedicated serving for production-critical models and shared serving for lower-priority models.
Auto-Scaling Configuration
Document auto-scaling parameters explicitly: the metric that triggers scaling (request count, latency, GPU utilization), the scaling step size, and the cooldown period. Under-documented auto-scaling is the most common source of serving incidents. Also document the behavior at scale limits: what happens when max instances are reached and traffic continues to grow? This failure mode should have an explicit plan.
If your traffic is bursty, consider a hybrid approach: maintain a baseline of always-on instances for steady traffic and use serverless or auto-scaling for burst capacity. This balances cost efficiency with latency consistency.
Serverless inference with scale-to-zero sounds cost-effective but introduces cold start latency that can violate latency SLAs. Test cold start behavior under realistic conditions before committing to a serverless serving architecture for latency-sensitive workloads.
Version History
1.0.0 · 2026-03-01
- • Initial ADR template for model serving architecture