Key Takeaway
Documenting the caching approach in an ADR ensures the team has shared understanding of cache invalidation rules and quality trade-offs accepted for cost savings.
When to Use This Template
Use this ADR when designing a caching layer for LLM inference, optimizing an existing caching strategy, or evaluating whether caching is appropriate for your AI workload. Caching can reduce inference costs significantly, but it introduces trade-offs around response freshness, personalization, and quality. This template helps the team make those trade-offs explicit and documented.
ADR Template
# ADR: AI Inference Caching Approach
## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-XXX]
## Date
YYYY-MM-DD
## Decision Makers
- [Name, Role]
## Context
### Workload Profile
- Request volume: [e.g., "10,000 requests/day"]
- Query diversity: [high (mostly unique) | medium (some repetition) | low (highly repetitive)]
- Personalization requirement: [none | user-context dependent | highly personalized]
- Freshness requirement: [real-time | hourly | daily | static knowledge only]
- Current monthly inference cost: [e.g., "$5,000/month"]
### Quality Constraints
- Acceptable quality degradation from caching: [none | minimal | moderate]
- Determinism requirement: [same input must produce same output | variation acceptable]
- Context sensitivity: [responses depend on user context | responses are context-free]
## Options Considered
| Criterion | No Cache | Exact Match | Semantic Cache | Hierarchical |
|-----------|----------|-------------|----------------|--------------|
| Expected hit rate | 0% | [estimate]% | [estimate]% | [estimate]% |
| Quality impact | None | None | [risk level] | [risk level] |
| Implementation effort | None | Low | Medium | High |
| Storage cost | None | Low | Medium | Medium |
| Invalidation complexity | N/A | Low | High | High |
| Projected cost savings | $0 | $[estimate] | $[estimate] | $[estimate] |
## Caching Strategy Details
### Exact Match Cache
- Cache key: [e.g., "hash of system prompt + user message + model + temperature"]
- TTL: [e.g., "24 hours"]
- Storage: [e.g., "Redis with LRU eviction"]
- Max cache size: [e.g., "10GB"]
### Semantic Cache (if selected)
- Similarity threshold: [e.g., "cosine similarity > 0.95"]
- Embedding model for cache keys: [model name]
- Quality validation: [how to verify cached response quality]
### Invalidation Rules
- Time-based: [TTL for different query types]
- Event-based: [what events trigger cache invalidation]
- Manual: [process for forcing cache clear]
## Decision
We will implement [caching approach] because [rationale].
## Consequences
- Expected cost reduction: [projected monthly savings]
- Quality trade-off: [explicit statement of accepted quality impact]
- Monitoring requirements: [cache hit rate, quality metrics, staleness tracking]
- Storage commitment: [projected storage growth]
## Review Trigger
- [ ] Cache hit rate drops below [threshold]%
- [ ] Quality complaints linked to cached responses exceed [threshold]
- [ ] Monthly inference cost exceeds [threshold] despite caching
- [ ] Query pattern shifts significantly (measured by hit rate change)Section-by-Section Guidance
Query Diversity Analysis
Before investing in caching infrastructure, measure your actual query diversity. Sample a week of production queries and analyze how many are exact or near duplicates. If fewer than 10% of queries repeat, exact-match caching will have minimal impact and semantic caching becomes the primary option. If more than 30% repeat, exact-match caching alone can deliver substantial savings with minimal complexity.
Semantic Caching Risks
Semantic caching introduces a quality risk that exact-match caching avoids. Two queries that are semantically similar may require different responses based on subtle differences in wording, context, or intent. Set the similarity threshold conservatively (0.95 or higher) and implement a quality monitoring pipeline that samples cached responses for human review. Lower the threshold only after you have evidence that quality is maintained.
Implement caching as a read-through layer that can be disabled with a feature flag. This lets you measure the quality impact by A/B testing cached vs. fresh responses and provides an immediate rollback mechanism if quality degrades.
Never cache responses that depend on real-time data, user-specific context, or safety-critical decisions without explicit invalidation rules for each dependency. Stale cached responses in these categories can cause user-visible errors or safety incidents.
Version History
1.0.0 · 2026-03-01
- • Initial ADR template for caching approach