Key Takeaway
AI cost optimization is not about spending less -- it is about spending deliberately. The highest-impact lever for most teams is matching model capability to task complexity: using your most capable model only where it matters and routing everything else to faster, cheaper alternatives.
Prerequisites
- At least one AI workload running in production with observable cost data
- Access to billing dashboards for your cloud provider and/or LLM API provider
- Basic understanding of token-based pricing for LLM APIs
- Familiarity with your application's query patterns and traffic volumes
- A cost tracking system or the ability to implement one (even a spreadsheet to start)
The AI Cost Problem
AI workloads follow a different cost curve than traditional software. Traditional SaaS applications scale costs roughly linearly with users: more users mean more compute, storage, and bandwidth, but the cost per user stays relatively stable. AI workloads break this model. Every inference call has a non-trivial marginal cost, and that cost varies dramatically based on the model, the prompt length, and the response complexity. A single feature powered by a frontier LLM can cost more per API call than your entire application server costs per request.
The challenge is compounded by the fact that AI costs are often opaque until the bill arrives. Engineering teams build features using the most capable model available during development, hard-code prompt templates that are longer than necessary, and skip caching because the traffic is low in staging. Then the feature launches, traffic scales, and the monthly bill becomes a conversation topic in the executive team meeting.
40-60%
LLM API Calls
The largest cost driver for most AI applications. Prompt and completion tokens at frontier model prices dominate the bill.
15-25%
Compute & GPU
GPU instances for self-hosted models, fine-tuning jobs, and embedding generation. Often over-provisioned.
10-20%
Storage & Embeddings
Vector databases, model artifact storage, training data, and embedding indices.
5-15%
Monitoring & Tooling
Observability platforms, experiment tracking, evaluation pipelines, and MLOps infrastructure.
Cost Anatomy: Where the Money Goes
Before you can optimize costs, you need to understand where they accumulate. The following diagram shows the six primary cost centers in a typical AI application stack. Most teams discover that one or two cost centers dominate their bill, and targeted optimization of those centers yields better results than trying to optimize everything at once.
Token Analysis & Optimization
For applications that rely on LLM API calls, token usage is the single largest cost driver. Every token in your prompt and every token in the model's response costs money. The good news is that most applications send far more tokens than necessary. Verbose system prompts, redundant context, unoptimized few-shot examples, and unbounded response lengths all contribute to inflated token counts. Optimizing token usage requires measuring it first.
Token Counting and Tracking
Before optimizing, instrument your application to track token usage per request. This gives you the baseline data you need to identify optimization targets and measure the impact of changes.
import tiktoken
from dataclasses import dataclass, field
from typing import Optional
import time
import json
@dataclass
class TokenUsage:
"""Track token usage for a single LLM call."""
prompt_tokens: int
completion_tokens: int
model: str
endpoint: str
timestamp: float = field(default_factory=time.time)
cache_hit: bool = False
estimated_cost_usd: float = 0.0
# Pricing per 1M tokens (input / output) -- update as prices change
MODEL_PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
"claude-haiku-3-5": {"input": 0.80, "output": 4.00},
}
def estimate_cost(usage: TokenUsage) -> float:
"""Estimate cost in USD for a single LLM call."""
pricing = MODEL_PRICING.get(usage.model)
if not pricing:
return 0.0
input_cost = (usage.prompt_tokens / 1_000_000) * pricing["input"]
output_cost = (usage.completion_tokens / 1_000_000) * pricing["output"]
return round(input_cost + output_cost, 6)
def count_tokens(text: str, model: str = "gpt-4o") -> int:
"""Count tokens for a given text using tiktoken."""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
encoding = tiktoken.get_encoding("cl100k_base")
return len(encoding.encode(text))Prompt Optimization Techniques
Prompt optimization is the lowest-effort, highest-impact cost reduction strategy for LLM-heavy applications. Most system prompts are written during development when token costs are not a concern and then never revisited. Common patterns that waste tokens include verbose role definitions, redundant instructions, overly detailed few-shot examples, and including context that the model does not need for the specific task.
def optimize_system_prompt(prompt: str) -> dict:
"""Analyze a system prompt and suggest optimizations.
Returns a dict with the original token count,
specific recommendations, and estimated savings.
"""
token_count = count_tokens(prompt)
recommendations = []
# Check for common waste patterns
lines = prompt.split("\n")
# 1. Redundant instructions
seen_instructions = set()
for i, line in enumerate(lines):
normalized = line.strip().lower()
if normalized in seen_instructions and len(normalized) > 20:
recommendations.append(
f"Line {i+1}: Duplicate instruction detected"
)
seen_instructions.add(normalized)
# 2. Verbose phrasing
verbose_patterns = {
"I want you to act as": "You are",
"Please make sure to": "",
"It is important that you": "",
"You should always remember to": "",
"Under no circumstances should you ever": "Never",
}
for verbose, concise in verbose_patterns.items():
if verbose.lower() in prompt.lower():
replacement = f"Replace with '{concise}'" if concise else "Remove"
recommendations.append(
f"Verbose phrasing: '{verbose}' -> {replacement}"
)
# 3. Few-shot example length
if prompt.count("Example:") > 3 or prompt.count("###") > 6:
recommendations.append(
"Consider reducing few-shot examples to 2-3 "
"representative cases instead of exhaustive coverage"
)
return {
"original_tokens": token_count,
"recommendations": recommendations,
"estimated_savings_pct": min(len(recommendations) * 5, 40),
}Run a prompt audit across your entire application. List every system prompt, measure its token count, and rank them by (token count * daily call volume). The top three entries on that list are your highest-value optimization targets.
Context Window Management
For applications with conversation history or RAG pipelines, context window management is critical. Naive approaches send the entire conversation history or all retrieved documents in every request, which inflates costs linearly with conversation length. Effective context management requires strategies for summarizing history, selecting relevant context, and truncating gracefully.
from typing import List, Dict
def manage_conversation_context(
messages: List[Dict[str, str]],
max_context_tokens: int = 4000,
model: str = "gpt-4o",
) -> List[Dict[str, str]]:
"""Trim conversation history to fit within a token budget.
Strategy: Keep the system prompt and last N messages,
summarize older messages if needed.
"""
if not messages:
return messages
system_msgs = [m for m in messages if m["role"] == "system"]
non_system = [m for m in messages if m["role"] != "system"]
# Always keep system prompt
system_tokens = sum(count_tokens(m["content"]) for m in system_msgs)
remaining_budget = max_context_tokens - system_tokens
# Work backwards from most recent, keeping messages
# until we exhaust the budget
kept_messages = []
used_tokens = 0
for msg in reversed(non_system):
msg_tokens = count_tokens(msg["content"])
if used_tokens + msg_tokens > remaining_budget:
break
kept_messages.insert(0, msg)
used_tokens += msg_tokens
return system_msgs + kept_messages
def select_rag_context(
query: str,
retrieved_chunks: List[Dict],
max_context_tokens: int = 3000,
) -> List[Dict]:
"""Select the most relevant RAG chunks within a token budget.
Chunks should already be sorted by relevance score.
We greedily add chunks until the budget is exhausted.
"""
selected = []
used_tokens = 0
for chunk in retrieved_chunks:
chunk_tokens = count_tokens(chunk["text"])
if used_tokens + chunk_tokens > max_context_tokens:
continue # Skip this chunk, try smaller ones
selected.append(chunk)
used_tokens += chunk_tokens
return selectedModel Selection & Routing
Not every request needs your most capable (and most expensive) model. A significant portion of production traffic involves straightforward tasks -- classification, extraction, formatting, simple Q&A -- that a smaller, cheaper model handles equally well. Intelligent model routing matches task complexity to model capability, sending only the requests that genuinely benefit from frontier-level reasoning to your most expensive model.
| Model Tier | Example Models | Best For | Relative Cost | Latency Profile |
|---|---|---|---|---|
| Frontier | GPT-4o, Claude Sonnet 4 | Complex reasoning, nuanced generation, multi-step analysis, code generation with architectural decisions | High (baseline) | 1-5s typical, up to 30s for long outputs |
| Mid-Tier | GPT-4o-mini, Claude Haiku 3.5 | Straightforward generation, summarization, translation, simple code tasks, structured extraction | 5-20x cheaper than frontier | 200ms-2s typical |
| Lightweight / Open | Llama 3, Mistral, Phi-3 | Classification, entity extraction, simple formatting, high-volume low-complexity tasks | 10-50x cheaper (self-hosted) or free (local) | 50-500ms (self-hosted GPU) |
| Specialized Fine-Tuned | Custom fine-tunes on smaller base models | Domain-specific tasks where a fine-tuned small model matches or exceeds a general frontier model | Variable -- high upfront training cost, low per-inference cost | Depends on base model size |
The most practical routing strategy for most teams is a two-tier approach: route simple tasks to a mid-tier model and complex tasks to a frontier model. Complexity classification can be rule-based (task type, input length, domain) or model-based (use a cheap classifier to estimate task complexity). Start with rules and graduate to model-based routing as you accumulate data.
from enum import Enum
from typing import Optional
class TaskComplexity(Enum):
SIMPLE = "simple"
MODERATE = "moderate"
COMPLEX = "complex"
# Map complexity to model -- adjust based on your quality requirements
ROUTING_TABLE = {
TaskComplexity.SIMPLE: "gpt-4o-mini",
TaskComplexity.MODERATE: "gpt-4o-mini",
TaskComplexity.COMPLEX: "gpt-4o",
}
def classify_task_complexity(
task_type: str,
input_tokens: int,
requires_reasoning: bool = False,
requires_code_generation: bool = False,
) -> TaskComplexity:
"""Rule-based task complexity classifier.
Start with rules, then replace with a trained classifier
once you have labeled data from production traffic.
"""
# Tasks that always need frontier models
if requires_reasoning or requires_code_generation:
return TaskComplexity.COMPLEX
# Classification, extraction, and formatting are
# typically simple regardless of input length
simple_tasks = {
"classification", "extraction", "formatting",
"translation", "sentiment", "summarize_short",
}
if task_type in simple_tasks:
return TaskComplexity.SIMPLE
# Long-form generation and analysis need more capability
complex_tasks = {
"analysis", "long_generation", "multi_step",
"code_review", "architecture",
}
if task_type in complex_tasks:
return TaskComplexity.COMPLEX
# Default to moderate for unknown task types
return TaskComplexity.MODERATE
def route_request(
task_type: str,
input_text: str,
**kwargs,
) -> str:
"""Select the appropriate model for a given request."""
input_tokens = count_tokens(input_text)
complexity = classify_task_complexity(
task_type, input_tokens, **kwargs
)
return ROUTING_TABLE[complexity]Track quality metrics per model tier in production. If your mid-tier model handles a task type with equivalent quality to your frontier model, you have found a permanent routing optimization. If quality drops noticeably, you know where the boundary is. This data is invaluable for future routing decisions.
Caching Strategies
Caching is the second-highest-impact optimization lever after model routing. Unlike traditional web caching where exact URL matching is sufficient, AI workload caching requires semantic matching -- recognizing that 'What is the capital of France?' and 'Tell me France's capital city' should return the same cached response. This section covers four caching strategies from simplest to most sophisticated.
Semantic Caching
Semantic caching uses embedding similarity to match incoming queries against previously cached responses. When a new query's embedding is sufficiently similar to a cached query's embedding, the cached response is returned without making an LLM call. The critical parameter is the similarity threshold: too high and you get few cache hits; too low and you return irrelevant cached responses.
import hashlib
import numpy as np
from typing import Optional, Tuple
import time
class SemanticCache:
"""LLM response cache using embedding similarity.
Uses cosine similarity to match semantically equivalent
queries, avoiding redundant LLM calls.
"""
def __init__(
self,
similarity_threshold: float = 0.95,
max_entries: int = 10_000,
ttl_seconds: int = 3600,
):
self.threshold = similarity_threshold
self.max_entries = max_entries
self.ttl = ttl_seconds
self.cache: dict[str, dict] = {}
self.embeddings: list[Tuple[str, np.ndarray]] = []
self.stats = {"hits": 0, "misses": 0, "evictions": 0}
def _cosine_similarity(
self, a: np.ndarray, b: np.ndarray
) -> float:
return float(
np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
)
def get(
self, query_embedding: np.ndarray
) -> Optional[str]:
"""Look up a semantically similar cached response."""
best_score = 0.0
best_key = None
for key, emb in self.embeddings:
# Skip expired entries
entry = self.cache.get(key)
if not entry:
continue
if time.time() - entry["timestamp"] > self.ttl:
continue
score = self._cosine_similarity(query_embedding, emb)
if score > best_score:
best_score = score
best_key = key
if best_score >= self.threshold and best_key:
self.stats["hits"] += 1
return self.cache[best_key]["response"]
self.stats["misses"] += 1
return None
def put(
self,
query: str,
query_embedding: np.ndarray,
response: str,
) -> None:
"""Store a query-response pair in the cache."""
key = hashlib.sha256(query.encode()).hexdigest()
# Evict oldest if at capacity
if len(self.cache) >= self.max_entries:
oldest_key = min(
self.cache, key=lambda k: self.cache[k]["timestamp"]
)
del self.cache[oldest_key]
self.embeddings = [
(k, e) for k, e in self.embeddings if k != oldest_key
]
self.stats["evictions"] += 1
self.cache[key] = {
"query": query,
"response": response,
"timestamp": time.time(),
}
self.embeddings.append((key, query_embedding))
@property
def hit_rate(self) -> float:
total = self.stats["hits"] + self.stats["misses"]
return self.stats["hits"] / total if total > 0 else 0.0Prompt Caching (Provider-Level)
Several LLM providers offer built-in prompt caching that reduces costs when the same system prompt prefix is reused across requests. Anthropic's prompt caching, for example, caches the system prompt and initial messages, charging reduced rates for cached tokens on subsequent requests. This is particularly effective for applications with long, stable system prompts.
# Anthropic prompt caching: mark cacheable content with
# cache_control to avoid re-processing stable prefixes.
#
# The system prompt and few-shot examples are cached on first
# request. Subsequent requests with the same prefix pay only
# the cache read cost (typically 90% cheaper than processing).
def build_cached_request(user_query: str) -> dict:
"""Build an API request that leverages prompt caching.
The system prompt and examples are stable across requests,
so they benefit from caching. Only the user query changes.
"""
return {
"model": "claude-sonnet-4-20250514",
"max_tokens": 1024,
"system": [
{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # 2000+ tokens
"cache_control": {"type": "ephemeral"},
}
],
"messages": [
{"role": "user", "content": user_query},
],
}
# Cost comparison for a 3000-token system prompt:
# Without caching: 3000 input tokens charged at full rate
# With caching (first request): 3000 tokens + small cache write fee
# With caching (subsequent): ~300 token equivalent (90% savings)
# Break-even: typically after 2-3 requests with the same prefixEmbedding Cache
Embedding generation is often overlooked as a cost center, but for RAG-heavy applications it can be significant. If you embed the same documents or chunks repeatedly (e.g., on every deployment or when rebuilding an index), caching embeddings avoids redundant API calls. A simple content-hash-to-embedding map stored in Redis or a local database is sufficient.
import hashlib
import json
from typing import Optional, List
import redis
class EmbeddingCache:
"""Cache embeddings by content hash to avoid redundant
embedding API calls.
Especially valuable when:
- Rebuilding vector indices after code changes
- Processing documents that overlap across pipelines
- Running evaluation suites with fixed test data
"""
def __init__(self, redis_url: str = "redis://localhost:6379"):
self.redis = redis.from_url(redis_url)
self.prefix = "emb_cache:"
def _content_hash(self, text: str, model: str) -> str:
"""Hash content + model to create cache key."""
key_input = f"{model}:{text}"
return hashlib.sha256(key_input.encode()).hexdigest()
def get(
self, text: str, model: str
) -> Optional[List[float]]:
"""Retrieve cached embedding if available."""
key = self.prefix + self._content_hash(text, model)
cached = self.redis.get(key)
if cached:
return json.loads(cached)
return None
def put(
self, text: str, model: str, embedding: List[float],
ttl_seconds: int = 86400 * 30, # 30 days default
) -> None:
"""Cache an embedding with expiration."""
key = self.prefix + self._content_hash(text, model)
self.redis.setex(key, ttl_seconds, json.dumps(embedding))
def get_or_compute(
self,
text: str,
model: str,
compute_fn,
) -> List[float]:
"""Get from cache or compute and cache."""
cached = self.get(text, model)
if cached is not None:
return cached
embedding = compute_fn(text, model)
self.put(text, model, embedding)
return embeddingTrack your cache hit rates per cache type and tune thresholds based on actual data. A well-tuned semantic cache typically achieves hit rates between 20% and 50% depending on query diversity. If your hit rate is below 10%, your similarity threshold may be too strict. If quality complaints correlate with cache hits, it may be too loose.
Cost Calculator
Use this calculator to estimate your monthly AI infrastructure costs across different scale tiers. Adjust the tier to match your current or projected usage. The components reflect typical cost centers for an AI application using a mix of API-based and self-hosted models.
Batch vs Real-Time Processing
One of the most overlooked cost optimization strategies is identifying which AI workloads actually need real-time inference and which can be processed in batches. Batch processing is significantly cheaper for three reasons: you can use lower-priority compute, you can take advantage of batch API pricing (typically 50% discount), and you can optimize throughput by batching similar requests together. Many features that seem to require real-time processing actually tolerate latency measured in seconds or even minutes.
Before
All LLM requests processed synchronously at real-time pricing. Every user action triggers an immediate API call at full per-token rates. Peak traffic drives GPU provisioning, leaving expensive instances idle during off-peak hours.
After
Requests classified as real-time or deferrable. Deferrable tasks queued and processed in batches at discounted rates. GPU instances scaled to average load, with batch jobs consuming spare capacity. Result: same output quality at significantly lower cost.
import asyncio
from dataclasses import dataclass
from typing import List, Callable, Any
from collections import deque
import time
@dataclass
class BatchItem:
"""A single item in the processing queue."""
payload: dict
callback: Callable
enqueued_at: float = 0.0
class BatchProcessor:
"""Collect requests and process them in batches.
Flushes when batch is full or max_wait_seconds elapses,
whichever comes first. This lets you use batch API pricing
while keeping latency bounded.
"""
def __init__(
self,
process_batch_fn: Callable,
batch_size: int = 20,
max_wait_seconds: float = 5.0,
):
self.process_fn = process_batch_fn
self.batch_size = batch_size
self.max_wait = max_wait_seconds
self.queue: deque[BatchItem] = deque()
self.stats = {
"batches_processed": 0,
"items_processed": 0,
"avg_batch_size": 0.0,
}
async def enqueue(self, payload: dict) -> Any:
"""Add a request to the batch queue.
Returns a future that resolves when the batch
containing this item is processed.
"""
future = asyncio.get_event_loop().create_future()
item = BatchItem(
payload=payload,
callback=lambda result: future.set_result(result),
enqueued_at=time.time(),
)
self.queue.append(item)
# Flush if batch is full
if len(self.queue) >= self.batch_size:
await self._flush()
return await future
async def _flush(self) -> None:
"""Process all queued items as a single batch."""
if not self.queue:
return
batch = []
while self.queue and len(batch) < self.batch_size:
batch.append(self.queue.popleft())
payloads = [item.payload for item in batch]
results = await self.process_fn(payloads)
for item, result in zip(batch, results):
item.callback(result)
# Update stats
self.stats["batches_processed"] += 1
self.stats["items_processed"] += len(batch)
total = self.stats["items_processed"]
batches = self.stats["batches_processed"]
self.stats["avg_batch_size"] = total / batchesInfrastructure Optimization
Infrastructure costs for AI workloads are dominated by GPU instances. Unlike CPU-bound workloads where instance selection is relatively forgiving, GPU instance selection has a dramatic impact on both cost and performance. Choosing the wrong GPU type for your workload can result in paying for capabilities you do not use (e.g., provisioning A100s for inference workloads that fit on T4s) or underperforming because the GPU memory is insufficient for your model.
GPU Instance Selection
Match GPU capability to workload requirements. The key variables are GPU memory (determines maximum model size), compute throughput (determines inference speed), and interconnect bandwidth (matters for distributed training, not for single-model inference). For most inference workloads, GPU memory is the binding constraint. For training workloads, compute throughput and interconnect speed dominate.
Spot and Preemptible Instances
Spot instances (AWS) and preemptible VMs (GCP) offer GPU compute at significant discounts in exchange for the risk of interruption. This trade-off is excellent for batch workloads, training jobs with checkpointing, and non-latency-sensitive inference. It is a poor fit for production inference endpoints that require consistent availability. The pattern that works for most teams is spot instances for training and development, with on-demand or reserved instances for production inference.
Auto-Scaling Patterns
AI inference workloads typically have bursty traffic patterns -- long periods of moderate load punctuated by spikes. Auto-scaling is essential but requires careful configuration. Scale-up must be fast enough to handle traffic bursts without excessive latency. Scale-down must be aggressive enough to avoid paying for idle GPUs but not so aggressive that you thrash between scaling states.
# Example auto-scaling configuration for an inference endpoint.
# These values are starting points -- tune based on your
# traffic patterns and latency requirements.
SCALING_CONFIG = {
"min_replicas": 1, # Never scale to zero for prod
"max_replicas": 10, # Budget cap
"target_gpu_utilization": 0.7, # Scale up above 70%
# Scale-up: react quickly to traffic spikes
"scale_up_cooldown_seconds": 60,
"scale_up_threshold_duration_seconds": 30,
# Scale-down: be more conservative to avoid thrashing
"scale_down_cooldown_seconds": 300,
"scale_down_threshold_duration_seconds": 120,
# Queue-based scaling: scale on pending requests,
# not just GPU utilization
"queue_length_per_replica": 5,
"max_queue_wait_seconds": 10,
}
def calculate_required_replicas(
current_rps: float,
avg_inference_time_ms: float,
target_utilization: float = 0.7,
) -> int:
"""Estimate required replicas for a target utilization.
Args:
current_rps: current requests per second
avg_inference_time_ms: average inference latency
target_utilization: target GPU utilization (0-1)
"""
# Each replica can handle (1000 / avg_inference_time_ms) RPS
# at 100% utilization
capacity_per_replica = 1000 / avg_inference_time_ms
effective_capacity = capacity_per_replica * target_utilization
import math
replicas = math.ceil(current_rps / effective_capacity)
return max(
SCALING_CONFIG["min_replicas"],
min(replicas, SCALING_CONFIG["max_replicas"]),
)Cost Monitoring & Alerting
Cost optimization is not a one-time project -- it requires continuous monitoring. Without cost observability, optimizations erode over time as new features ship with unoptimized configurations, traffic patterns change, and pricing adjustments take effect. Build a cost monitoring layer that tracks spending by model, by feature, and by environment, with alerts that fire before costs exceed thresholds.
from dataclasses import dataclass, field
from typing import Dict, List, Optional
from collections import defaultdict
import time
@dataclass
class CostAlert:
"""A triggered cost alert."""
alert_type: str
message: str
current_value: float
threshold: float
timestamp: float
class CostMonitor:
"""Track and alert on AI spending by feature and model.
Designed for integration with your observability stack.
Emit metrics to Prometheus/Datadog/CloudWatch as needed.
"""
def __init__(self):
self.hourly_costs: Dict[str, float] = defaultdict(float)
self.daily_costs: Dict[str, float] = defaultdict(float)
self.feature_costs: Dict[str, float] = defaultdict(float)
self.alerts: List[CostAlert] = []
# Configure thresholds per feature
self.hourly_thresholds: Dict[str, float] = {}
self.daily_thresholds: Dict[str, float] = {}
def record_cost(
self,
cost_usd: float,
model: str,
feature: str,
) -> Optional[CostAlert]:
"""Record a cost event and check thresholds."""
hour_key = f"{feature}:{int(time.time() // 3600)}"
day_key = f"{feature}:{int(time.time() // 86400)}"
self.hourly_costs[hour_key] += cost_usd
self.daily_costs[day_key] += cost_usd
self.feature_costs[feature] += cost_usd
# Check hourly threshold
hourly_threshold = self.hourly_thresholds.get(feature)
if hourly_threshold:
current_hourly = self.hourly_costs[hour_key]
if current_hourly > hourly_threshold:
alert = CostAlert(
alert_type="hourly_budget_exceeded",
message=(
f"Feature '{feature}' spent "
f"${current_hourly:.2f} this hour "
f"(threshold: ${hourly_threshold:.2f})"
),
current_value=current_hourly,
threshold=hourly_threshold,
timestamp=time.time(),
)
self.alerts.append(alert)
return alert
return None
def get_cost_summary(self) -> Dict[str, float]:
"""Get current cost summary by feature."""
return dict(self.feature_costs)
def set_threshold(
self,
feature: str,
hourly: Optional[float] = None,
daily: Optional[float] = None,
) -> None:
"""Set cost alert thresholds for a feature."""
if hourly is not None:
self.hourly_thresholds[feature] = hourly
if daily is not None:
self.daily_thresholds[feature] = dailySet up a weekly cost review ritual. Spend 15 minutes reviewing cost-per-feature trends, identifying any new features shipping without cost optimization, and checking cache hit rates. Small weekly reviews prevent large quarterly surprises.
Quick Wins Checklist
The following checklist covers optimizations ordered by effort-to-impact ratio. Start at the top and work your way down. Most teams can achieve meaningful cost reduction by completing just the first category.
Immediate Wins (This Week)
Short-Term (This Month)
Strategic (This Quarter)
Version History
1.0.0 · 2026-03-01
- • Initial release covering token optimization, model routing, caching strategies, batch processing, infrastructure optimization, and cost monitoring
- • Interactive cost calculator with three scale tiers
- • Architecture diagram showing six primary cost centers
- • Code examples for semantic caching, model routing, batch processing, and cost monitoring
- • Quick wins checklist organized by implementation timeframe