Key Takeaway

AI cost optimization is not about spending less -- it is about spending deliberately. The highest-impact lever for most teams is matching model capability to task complexity: using your most capable model only where it matters and routing everything else to faster, cheaper alternatives.

Prerequisites

At least one AI workload running in production with observable cost data
Access to billing dashboards for your cloud provider and/or LLM API provider
Basic understanding of token-based pricing for LLM APIs
Familiarity with your application's query patterns and traffic volumes
A cost tracking system or the ability to implement one (even a spreadsheet to start)

The AI Cost Problem

AI workloads follow a different cost curve than traditional software. Traditional SaaS applications scale costs roughly linearly with users: more users mean more compute, storage, and bandwidth, but the cost per user stays relatively stable. AI workloads break this model. Every inference call has a non-trivial marginal cost, and that cost varies dramatically based on the model, the prompt length, and the response complexity. A single feature powered by a frontier LLM can cost more per API call than your entire application server costs per request.

The challenge is compounded by the fact that AI costs are often opaque until the bill arrives. Engineering teams build features using the most capable model available during development, hard-code prompt templates that are longer than necessary, and skip caching because the traffic is low in staging. Then the feature launches, traffic scales, and the monthly bill becomes a conversation topic in the executive team meeting.

40-60%

LLM API Calls

The largest cost driver for most AI applications. Prompt and completion tokens at frontier model prices dominate the bill.

15-25%

Compute & GPU

GPU instances for self-hosted models, fine-tuning jobs, and embedding generation. Often over-provisioned.

10-20%

Storage & Embeddings

Vector databases, model artifact storage, training data, and embedding indices.

5-15%

Monitoring & Tooling

Observability platforms, experiment tracking, evaluation pipelines, and MLOps infrastructure.

Cost Anatomy: Where the Money Goes

Before you can optimize costs, you need to understand where they accumulate. The following diagram shows the six primary cost centers in a typical AI application stack. Most teams discover that one or two cost centers dominate their bill, and targeted optimization of those centers yields better results than trying to optimize everything at once.

Token Analysis & Optimization

For applications that rely on LLM API calls, token usage is the single largest cost driver. Every token in your prompt and every token in the model's response costs money. The good news is that most applications send far more tokens than necessary. Verbose system prompts, redundant context, unoptimized few-shot examples, and unbounded response lengths all contribute to inflated token counts. Optimizing token usage requires measuring it first.

Token Counting and Tracking

Before optimizing, instrument your application to track token usage per request. This gives you the baseline data you need to identify optimization targets and measure the impact of changes.

token_tracker.py

import tiktoken
from dataclasses import dataclass, field
from typing import Optional
import time
import json


@dataclass
class TokenUsage:
    """Track token usage for a single LLM call."""
    prompt_tokens: int
    completion_tokens: int
    model: str
    endpoint: str
    timestamp: float = field(default_factory=time.time)
    cache_hit: bool = False
    estimated_cost_usd: float = 0.0


# Pricing per 1M tokens (input / output) -- update as prices change
MODEL_PRICING = {
    "gpt-4o":          {"input": 2.50,  "output": 10.00},
    "gpt-4o-mini":     {"input": 0.15,  "output": 0.60},
    "claude-sonnet-4-20250514":   {"input": 3.00,  "output": 15.00},
    "claude-haiku-3-5": {"input": 0.80,  "output": 4.00},
}


def estimate_cost(usage: TokenUsage) -> float:
    """Estimate cost in USD for a single LLM call."""
    pricing = MODEL_PRICING.get(usage.model)
    if not pricing:
        return 0.0
    input_cost = (usage.prompt_tokens / 1_000_000) * pricing["input"]
    output_cost = (usage.completion_tokens / 1_000_000) * pricing["output"]
    return round(input_cost + output_cost, 6)


def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for a given text using tiktoken."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

Prompt Optimization Techniques

Prompt optimization is the lowest-effort, highest-impact cost reduction strategy for LLM-heavy applications. Most system prompts are written during development when token costs are not a concern and then never revisited. Common patterns that waste tokens include verbose role definitions, redundant instructions, overly detailed few-shot examples, and including context that the model does not need for the specific task.

prompt_optimizer.py

def optimize_system_prompt(prompt: str) -> dict:
    """Analyze a system prompt and suggest optimizations.

    Returns a dict with the original token count,
    specific recommendations, and estimated savings.
    """
    token_count = count_tokens(prompt)
    recommendations = []

    # Check for common waste patterns
    lines = prompt.split("\n")

    # 1. Redundant instructions
    seen_instructions = set()
    for i, line in enumerate(lines):
        normalized = line.strip().lower()
        if normalized in seen_instructions and len(normalized) > 20:
            recommendations.append(
                f"Line {i+1}: Duplicate instruction detected"
            )
        seen_instructions.add(normalized)

    # 2. Verbose phrasing
    verbose_patterns = {
        "I want you to act as": "You are",
        "Please make sure to": "",
        "It is important that you": "",
        "You should always remember to": "",
        "Under no circumstances should you ever": "Never",
    }
    for verbose, concise in verbose_patterns.items():
        if verbose.lower() in prompt.lower():
            replacement = f"Replace with '{concise}'" if concise else "Remove"
            recommendations.append(
                f"Verbose phrasing: '{verbose}' -> {replacement}"
            )

    # 3. Few-shot example length
    if prompt.count("Example:") > 3 or prompt.count("###") > 6:
        recommendations.append(
            "Consider reducing few-shot examples to 2-3 "
            "representative cases instead of exhaustive coverage"
        )

    return {
        "original_tokens": token_count,
        "recommendations": recommendations,
        "estimated_savings_pct": min(len(recommendations) * 5, 40),
    }

Run a prompt audit across your entire application. List every system prompt, measure its token count, and rank them by (token count * daily call volume). The top three entries on that list are your highest-value optimization targets.

Context Window Management

For applications with conversation history or RAG pipelines, context window management is critical. Naive approaches send the entire conversation history or all retrieved documents in every request, which inflates costs linearly with conversation length. Effective context management requires strategies for summarizing history, selecting relevant context, and truncating gracefully.

context_manager.py

from typing import List, Dict


def manage_conversation_context(
    messages: List[Dict[str, str]],
    max_context_tokens: int = 4000,
    model: str = "gpt-4o",
) -> List[Dict[str, str]]:
    """Trim conversation history to fit within a token budget.

    Strategy: Keep the system prompt and last N messages,
    summarize older messages if needed.
    """
    if not messages:
        return messages

    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]

    # Always keep system prompt
    system_tokens = sum(count_tokens(m["content"]) for m in system_msgs)
    remaining_budget = max_context_tokens - system_tokens

    # Work backwards from most recent, keeping messages
    # until we exhaust the budget
    kept_messages = []
    used_tokens = 0

    for msg in reversed(non_system):
        msg_tokens = count_tokens(msg["content"])
        if used_tokens + msg_tokens > remaining_budget:
            break
        kept_messages.insert(0, msg)
        used_tokens += msg_tokens

    return system_msgs + kept_messages


def select_rag_context(
    query: str,
    retrieved_chunks: List[Dict],
    max_context_tokens: int = 3000,
) -> List[Dict]:
    """Select the most relevant RAG chunks within a token budget.

    Chunks should already be sorted by relevance score.
    We greedily add chunks until the budget is exhausted.
    """
    selected = []
    used_tokens = 0

    for chunk in retrieved_chunks:
        chunk_tokens = count_tokens(chunk["text"])
        if used_tokens + chunk_tokens > max_context_tokens:
            continue  # Skip this chunk, try smaller ones
        selected.append(chunk)
        used_tokens += chunk_tokens

    return selected

Model Selection & Routing

Not every request needs your most capable (and most expensive) model. A significant portion of production traffic involves straightforward tasks -- classification, extraction, formatting, simple Q&A -- that a smaller, cheaper model handles equally well. Intelligent model routing matches task complexity to model capability, sending only the requests that genuinely benefit from frontier-level reasoning to your most expensive model.

Model Tier	Example Models	Best For	Relative Cost	Latency Profile
Frontier	GPT-4o, Claude Sonnet 4	Complex reasoning, nuanced generation, multi-step analysis, code generation with architectural decisions	High (baseline)	1-5s typical, up to 30s for long outputs
Mid-Tier	GPT-4o-mini, Claude Haiku 3.5	Straightforward generation, summarization, translation, simple code tasks, structured extraction	5-20x cheaper than frontier	200ms-2s typical
Lightweight / Open	Llama 3, Mistral, Phi-3	Classification, entity extraction, simple formatting, high-volume low-complexity tasks	10-50x cheaper (self-hosted) or free (local)	50-500ms (self-hosted GPU)
Specialized Fine-Tuned	Custom fine-tunes on smaller base models	Domain-specific tasks where a fine-tuned small model matches or exceeds a general frontier model	Variable -- high upfront training cost, low per-inference cost	Depends on base model size

The most practical routing strategy for most teams is a two-tier approach: route simple tasks to a mid-tier model and complex tasks to a frontier model. Complexity classification can be rule-based (task type, input length, domain) or model-based (use a cheap classifier to estimate task complexity). Start with rules and graduate to model-based routing as you accumulate data.

model_router.py

from enum import Enum
from typing import Optional


class TaskComplexity(Enum):
    SIMPLE = "simple"
    MODERATE = "moderate"
    COMPLEX = "complex"


# Map complexity to model -- adjust based on your quality requirements
ROUTING_TABLE = {
    TaskComplexity.SIMPLE: "gpt-4o-mini",
    TaskComplexity.MODERATE: "gpt-4o-mini",
    TaskComplexity.COMPLEX: "gpt-4o",
}


def classify_task_complexity(
    task_type: str,
    input_tokens: int,
    requires_reasoning: bool = False,
    requires_code_generation: bool = False,
) -> TaskComplexity:
    """Rule-based task complexity classifier.

    Start with rules, then replace with a trained classifier
    once you have labeled data from production traffic.
    """
    # Tasks that always need frontier models
    if requires_reasoning or requires_code_generation:
        return TaskComplexity.COMPLEX

    # Classification, extraction, and formatting are
    # typically simple regardless of input length
    simple_tasks = {
        "classification", "extraction", "formatting",
        "translation", "sentiment", "summarize_short",
    }
    if task_type in simple_tasks:
        return TaskComplexity.SIMPLE

    # Long-form generation and analysis need more capability
    complex_tasks = {
        "analysis", "long_generation", "multi_step",
        "code_review", "architecture",
    }
    if task_type in complex_tasks:
        return TaskComplexity.COMPLEX

    # Default to moderate for unknown task types
    return TaskComplexity.MODERATE


def route_request(
    task_type: str,
    input_text: str,
    **kwargs,
) -> str:
    """Select the appropriate model for a given request."""
    input_tokens = count_tokens(input_text)
    complexity = classify_task_complexity(
        task_type, input_tokens, **kwargs
    )
    return ROUTING_TABLE[complexity]

Track quality metrics per model tier in production. If your mid-tier model handles a task type with equivalent quality to your frontier model, you have found a permanent routing optimization. If quality drops noticeably, you know where the boundary is. This data is invaluable for future routing decisions.

Caching Strategies

Caching is the second-highest-impact optimization lever after model routing. Unlike traditional web caching where exact URL matching is sufficient, AI workload caching requires semantic matching -- recognizing that 'What is the capital of France?' and 'Tell me France's capital city' should return the same cached response. This section covers four caching strategies from simplest to most sophisticated.

Semantic Caching

Semantic caching uses embedding similarity to match incoming queries against previously cached responses. When a new query's embedding is sufficiently similar to a cached query's embedding, the cached response is returned without making an LLM call. The critical parameter is the similarity threshold: too high and you get few cache hits; too low and you return irrelevant cached responses.

semantic_cache.py

import hashlib
import numpy as np
from typing import Optional, Tuple
import time


class SemanticCache:
    """LLM response cache using embedding similarity.

    Uses cosine similarity to match semantically equivalent
    queries, avoiding redundant LLM calls.
    """

    def __init__(
        self,
        similarity_threshold: float = 0.95,
        max_entries: int = 10_000,
        ttl_seconds: int = 3600,
    ):
        self.threshold = similarity_threshold
        self.max_entries = max_entries
        self.ttl = ttl_seconds
        self.cache: dict[str, dict] = {}
        self.embeddings: list[Tuple[str, np.ndarray]] = []
        self.stats = {"hits": 0, "misses": 0, "evictions": 0}

    def _cosine_similarity(
        self, a: np.ndarray, b: np.ndarray
    ) -> float:
        return float(
            np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
        )

    def get(
        self, query_embedding: np.ndarray
    ) -> Optional[str]:
        """Look up a semantically similar cached response."""
        best_score = 0.0
        best_key = None

        for key, emb in self.embeddings:
            # Skip expired entries
            entry = self.cache.get(key)
            if not entry:
                continue
            if time.time() - entry["timestamp"] > self.ttl:
                continue

            score = self._cosine_similarity(query_embedding, emb)
            if score > best_score:
                best_score = score
                best_key = key

        if best_score >= self.threshold and best_key:
            self.stats["hits"] += 1
            return self.cache[best_key]["response"]

        self.stats["misses"] += 1
        return None

    def put(
        self,
        query: str,
        query_embedding: np.ndarray,
        response: str,
    ) -> None:
        """Store a query-response pair in the cache."""
        key = hashlib.sha256(query.encode()).hexdigest()

        # Evict oldest if at capacity
        if len(self.cache) >= self.max_entries:
            oldest_key = min(
                self.cache, key=lambda k: self.cache[k]["timestamp"]
            )
            del self.cache[oldest_key]
            self.embeddings = [
                (k, e) for k, e in self.embeddings if k != oldest_key
            ]
            self.stats["evictions"] += 1

        self.cache[key] = {
            "query": query,
            "response": response,
            "timestamp": time.time(),
        }
        self.embeddings.append((key, query_embedding))

    @property
    def hit_rate(self) -> float:
        total = self.stats["hits"] + self.stats["misses"]
        return self.stats["hits"] / total if total > 0 else 0.0

Prompt Caching (Provider-Level)

Several LLM providers offer built-in prompt caching that reduces costs when the same system prompt prefix is reused across requests. Anthropic's prompt caching, for example, caches the system prompt and initial messages, charging reduced rates for cached tokens on subsequent requests. This is particularly effective for applications with long, stable system prompts.

prompt_caching_example.py

# Anthropic prompt caching: mark cacheable content with
# cache_control to avoid re-processing stable prefixes.
#
# The system prompt and few-shot examples are cached on first
# request. Subsequent requests with the same prefix pay only
# the cache read cost (typically 90% cheaper than processing).

def build_cached_request(user_query: str) -> dict:
    """Build an API request that leverages prompt caching.

    The system prompt and examples are stable across requests,
    so they benefit from caching. Only the user query changes.
    """
    return {
        "model": "claude-sonnet-4-20250514",
        "max_tokens": 1024,
        "system": [
            {
                "type": "text",
                "text": LONG_SYSTEM_PROMPT,  # 2000+ tokens
                "cache_control": {"type": "ephemeral"},
            }
        ],
        "messages": [
            {"role": "user", "content": user_query},
        ],
    }


# Cost comparison for a 3000-token system prompt:
# Without caching: 3000 input tokens charged at full rate
# With caching (first request): 3000 tokens + small cache write fee
# With caching (subsequent): ~300 token equivalent (90% savings)
# Break-even: typically after 2-3 requests with the same prefix

Embedding Cache

Embedding generation is often overlooked as a cost center, but for RAG-heavy applications it can be significant. If you embed the same documents or chunks repeatedly (e.g., on every deployment or when rebuilding an index), caching embeddings avoids redundant API calls. A simple content-hash-to-embedding map stored in Redis or a local database is sufficient.

embedding_cache.py

import hashlib
import json
from typing import Optional, List
import redis


class EmbeddingCache:
    """Cache embeddings by content hash to avoid redundant
    embedding API calls.

    Especially valuable when:
    - Rebuilding vector indices after code changes
    - Processing documents that overlap across pipelines
    - Running evaluation suites with fixed test data
    """

    def __init__(self, redis_url: str = "redis://localhost:6379"):
        self.redis = redis.from_url(redis_url)
        self.prefix = "emb_cache:"

    def _content_hash(self, text: str, model: str) -> str:
        """Hash content + model to create cache key."""
        key_input = f"{model}:{text}"
        return hashlib.sha256(key_input.encode()).hexdigest()

    def get(
        self, text: str, model: str
    ) -> Optional[List[float]]:
        """Retrieve cached embedding if available."""
        key = self.prefix + self._content_hash(text, model)
        cached = self.redis.get(key)
        if cached:
            return json.loads(cached)
        return None

    def put(
        self, text: str, model: str, embedding: List[float],
        ttl_seconds: int = 86400 * 30,  # 30 days default
    ) -> None:
        """Cache an embedding with expiration."""
        key = self.prefix + self._content_hash(text, model)
        self.redis.setex(key, ttl_seconds, json.dumps(embedding))

    def get_or_compute(
        self,
        text: str,
        model: str,
        compute_fn,
    ) -> List[float]:
        """Get from cache or compute and cache."""
        cached = self.get(text, model)
        if cached is not None:
            return cached

        embedding = compute_fn(text, model)
        self.put(text, model, embedding)
        return embedding

Track your cache hit rates per cache type and tune thresholds based on actual data. A well-tuned semantic cache typically achieves hit rates between 20% and 50% depending on query diversity. If your hit rate is below 10%, your similarity threshold may be too strict. If quality complaints correlate with cache hits, it may be too loose.

Cost Calculator

Use this calculator to estimate your monthly AI infrastructure costs across different scale tiers. Adjust the tier to match your current or projected usage. The components reflect typical cost centers for an AI application using a mix of API-based and self-hosted models.

Batch vs Real-Time Processing

One of the most overlooked cost optimization strategies is identifying which AI workloads actually need real-time inference and which can be processed in batches. Batch processing is significantly cheaper for three reasons: you can use lower-priority compute, you can take advantage of batch API pricing (typically 50% discount), and you can optimize throughput by batching similar requests together. Many features that seem to require real-time processing actually tolerate latency measured in seconds or even minutes.

Before

All LLM requests processed synchronously at real-time pricing. Every user action triggers an immediate API call at full per-token rates. Peak traffic drives GPU provisioning, leaving expensive instances idle during off-peak hours.

After

Requests classified as real-time or deferrable. Deferrable tasks queued and processed in batches at discounted rates. GPU instances scaled to average load, with batch jobs consuming spare capacity. Result: same output quality at significantly lower cost.

batch_processor.py

import asyncio
from dataclasses import dataclass
from typing import List, Callable, Any
from collections import deque
import time


@dataclass
class BatchItem:
    """A single item in the processing queue."""
    payload: dict
    callback: Callable
    enqueued_at: float = 0.0


class BatchProcessor:
    """Collect requests and process them in batches.

    Flushes when batch is full or max_wait_seconds elapses,
    whichever comes first. This lets you use batch API pricing
    while keeping latency bounded.
    """

    def __init__(
        self,
        process_batch_fn: Callable,
        batch_size: int = 20,
        max_wait_seconds: float = 5.0,
    ):
        self.process_fn = process_batch_fn
        self.batch_size = batch_size
        self.max_wait = max_wait_seconds
        self.queue: deque[BatchItem] = deque()
        self.stats = {
            "batches_processed": 0,
            "items_processed": 0,
            "avg_batch_size": 0.0,
        }

    async def enqueue(self, payload: dict) -> Any:
        """Add a request to the batch queue.

        Returns a future that resolves when the batch
        containing this item is processed.
        """
        future = asyncio.get_event_loop().create_future()
        item = BatchItem(
            payload=payload,
            callback=lambda result: future.set_result(result),
            enqueued_at=time.time(),
        )
        self.queue.append(item)

        # Flush if batch is full
        if len(self.queue) >= self.batch_size:
            await self._flush()

        return await future

    async def _flush(self) -> None:
        """Process all queued items as a single batch."""
        if not self.queue:
            return

        batch = []
        while self.queue and len(batch) < self.batch_size:
            batch.append(self.queue.popleft())

        payloads = [item.payload for item in batch]
        results = await self.process_fn(payloads)

        for item, result in zip(batch, results):
            item.callback(result)

        # Update stats
        self.stats["batches_processed"] += 1
        self.stats["items_processed"] += len(batch)
        total = self.stats["items_processed"]
        batches = self.stats["batches_processed"]
        self.stats["avg_batch_size"] = total / batches

Infrastructure Optimization

Infrastructure costs for AI workloads are dominated by GPU instances. Unlike CPU-bound workloads where instance selection is relatively forgiving, GPU instance selection has a dramatic impact on both cost and performance. Choosing the wrong GPU type for your workload can result in paying for capabilities you do not use (e.g., provisioning A100s for inference workloads that fit on T4s) or underperforming because the GPU memory is insufficient for your model.

GPU Instance Selection

Match GPU capability to workload requirements. The key variables are GPU memory (determines maximum model size), compute throughput (determines inference speed), and interconnect bandwidth (matters for distributed training, not for single-model inference). For most inference workloads, GPU memory is the binding constraint. For training workloads, compute throughput and interconnect speed dominate.

Spot and Preemptible Instances

Spot instances (AWS) and preemptible VMs (GCP) offer GPU compute at significant discounts in exchange for the risk of interruption. This trade-off is excellent for batch workloads, training jobs with checkpointing, and non-latency-sensitive inference. It is a poor fit for production inference endpoints that require consistent availability. The pattern that works for most teams is spot instances for training and development, with on-demand or reserved instances for production inference.

Auto-Scaling Patterns

AI inference workloads typically have bursty traffic patterns -- long periods of moderate load punctuated by spikes. Auto-scaling is essential but requires careful configuration. Scale-up must be fast enough to handle traffic bursts without excessive latency. Scale-down must be aggressive enough to avoid paying for idle GPUs but not so aggressive that you thrash between scaling states.

scaling_config.py

# Example auto-scaling configuration for an inference endpoint.
# These values are starting points -- tune based on your
# traffic patterns and latency requirements.

SCALING_CONFIG = {
    "min_replicas": 1,        # Never scale to zero for prod
    "max_replicas": 10,       # Budget cap
    "target_gpu_utilization": 0.7,  # Scale up above 70%

    # Scale-up: react quickly to traffic spikes
    "scale_up_cooldown_seconds": 60,
    "scale_up_threshold_duration_seconds": 30,

    # Scale-down: be more conservative to avoid thrashing
    "scale_down_cooldown_seconds": 300,
    "scale_down_threshold_duration_seconds": 120,

    # Queue-based scaling: scale on pending requests,
    # not just GPU utilization
    "queue_length_per_replica": 5,
    "max_queue_wait_seconds": 10,
}


def calculate_required_replicas(
    current_rps: float,
    avg_inference_time_ms: float,
    target_utilization: float = 0.7,
) -> int:
    """Estimate required replicas for a target utilization.

    Args:
        current_rps: current requests per second
        avg_inference_time_ms: average inference latency
        target_utilization: target GPU utilization (0-1)
    """
    # Each replica can handle (1000 / avg_inference_time_ms) RPS
    # at 100% utilization
    capacity_per_replica = 1000 / avg_inference_time_ms
    effective_capacity = capacity_per_replica * target_utilization

    import math
    replicas = math.ceil(current_rps / effective_capacity)
    return max(
        SCALING_CONFIG["min_replicas"],
        min(replicas, SCALING_CONFIG["max_replicas"]),
    )

Cost Monitoring & Alerting

Cost optimization is not a one-time project -- it requires continuous monitoring. Without cost observability, optimizations erode over time as new features ship with unoptimized configurations, traffic patterns change, and pricing adjustments take effect. Build a cost monitoring layer that tracks spending by model, by feature, and by environment, with alerts that fire before costs exceed thresholds.

cost_monitor.py

from dataclasses import dataclass, field
from typing import Dict, List, Optional
from collections import defaultdict
import time


@dataclass
class CostAlert:
    """A triggered cost alert."""
    alert_type: str
    message: str
    current_value: float
    threshold: float
    timestamp: float


class CostMonitor:
    """Track and alert on AI spending by feature and model.

    Designed for integration with your observability stack.
    Emit metrics to Prometheus/Datadog/CloudWatch as needed.
    """

    def __init__(self):
        self.hourly_costs: Dict[str, float] = defaultdict(float)
        self.daily_costs: Dict[str, float] = defaultdict(float)
        self.feature_costs: Dict[str, float] = defaultdict(float)
        self.alerts: List[CostAlert] = []

        # Configure thresholds per feature
        self.hourly_thresholds: Dict[str, float] = {}
        self.daily_thresholds: Dict[str, float] = {}

    def record_cost(
        self,
        cost_usd: float,
        model: str,
        feature: str,
    ) -> Optional[CostAlert]:
        """Record a cost event and check thresholds."""
        hour_key = f"{feature}:{int(time.time() // 3600)}"
        day_key = f"{feature}:{int(time.time() // 86400)}"

        self.hourly_costs[hour_key] += cost_usd
        self.daily_costs[day_key] += cost_usd
        self.feature_costs[feature] += cost_usd

        # Check hourly threshold
        hourly_threshold = self.hourly_thresholds.get(feature)
        if hourly_threshold:
            current_hourly = self.hourly_costs[hour_key]
            if current_hourly > hourly_threshold:
                alert = CostAlert(
                    alert_type="hourly_budget_exceeded",
                    message=(
                        f"Feature '{feature}' spent "
                        f"${current_hourly:.2f} this hour "
                        f"(threshold: ${hourly_threshold:.2f})"
                    ),
                    current_value=current_hourly,
                    threshold=hourly_threshold,
                    timestamp=time.time(),
                )
                self.alerts.append(alert)
                return alert

        return None

    def get_cost_summary(self) -> Dict[str, float]:
        """Get current cost summary by feature."""
        return dict(self.feature_costs)

    def set_threshold(
        self,
        feature: str,
        hourly: Optional[float] = None,
        daily: Optional[float] = None,
    ) -> None:
        """Set cost alert thresholds for a feature."""
        if hourly is not None:
            self.hourly_thresholds[feature] = hourly
        if daily is not None:
            self.daily_thresholds[feature] = daily

Set up a weekly cost review ritual. Spend 15 minutes reviewing cost-per-feature trends, identifying any new features shipping without cost optimization, and checking cache hit rates. Small weekly reviews prevent large quarterly surprises.

Quick Wins Checklist

The following checklist covers optimizations ordered by effort-to-impact ratio. Start at the top and work your way down. Most teams can achieve meaningful cost reduction by completing just the first category.

Immediate Wins (This Week)

Short-Term (This Month)

Strategic (This Quarter)

Version History

1.0.0 · 2026-03-01

• Initial release covering token optimization, model routing, caching strategies, batch processing, infrastructure optimization, and cost monitoring
• Interactive cost calculator with three scale tiers
• Architecture diagram showing six primary cost centers
• Code examples for semantic caching, model routing, batch processing, and cost monitoring
• Quick wins checklist organized by implementation timeframe

Key Takeaway

Prerequisites

At least one AI workload running in production with observable cost data
Access to billing dashboards for your cloud provider and/or LLM API provider
Basic understanding of token-based pricing for LLM APIs
Familiarity with your application's query patterns and traffic volumes
A cost tracking system or the ability to implement one (even a spreadsheet to start)

The AI Cost Problem

40-60%

LLM API Calls

The largest cost driver for most AI applications. Prompt and completion tokens at frontier model prices dominate the bill.

15-25%

Compute & GPU

GPU instances for self-hosted models, fine-tuning jobs, and embedding generation. Often over-provisioned.

10-20%

Storage & Embeddings

Vector databases, model artifact storage, training data, and embedding indices.

5-15%

Monitoring & Tooling

Observability platforms, experiment tracking, evaluation pipelines, and MLOps infrastructure.

Cost Anatomy: Where the Money Goes

Architecture Diagram

Process

Connections

User RequestModel Router(query)

Model RouterLLM API Calls(complex)

Model RouterCompute / GPU(self-hosted)

Model RouterEmbeddings(RAG lookup)

LLM API CallsStorage(logs)

LLM API CallsMonitoring(metrics)

Compute / GPUMonitoring(metrics)

EmbeddingsStorage(vectors)

Compute / GPUDev & Tooling(experiments)

AI application cost centers. Red and orange nodes typically account for the majority of spend. Optimize these first.

Token Analysis & Optimization

Token Counting and Tracking

Before optimizing, instrument your application to track token usage per request. This gives you the baseline data you need to identify optimization targets and measure the impact of changes.

token_tracker.py

import tiktoken
from dataclasses import dataclass, field
from typing import Optional
import time
import json


@dataclass
class TokenUsage:
    """Track token usage for a single LLM call."""
    prompt_tokens: int
    completion_tokens: int
    model: str
    endpoint: str
    timestamp: float = field(default_factory=time.time)
    cache_hit: bool = False
    estimated_cost_usd: float = 0.0


# Pricing per 1M tokens (input / output) -- update as prices change
MODEL_PRICING = {
    "gpt-4o":          {"input": 2.50,  "output": 10.00},
    "gpt-4o-mini":     {"input": 0.15,  "output": 0.60},
    "claude-sonnet-4-20250514":   {"input": 3.00,  "output": 15.00},
    "claude-haiku-3-5": {"input": 0.80,  "output": 4.00},
}


def estimate_cost(usage: TokenUsage) -> float:
    """Estimate cost in USD for a single LLM call."""
    pricing = MODEL_PRICING.get(usage.model)
    if not pricing:
        return 0.0
    input_cost = (usage.prompt_tokens / 1_000_000) * pricing["input"]
    output_cost = (usage.completion_tokens / 1_000_000) * pricing["output"]
    return round(input_cost + output_cost, 6)


def count_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for a given text using tiktoken."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

Prompt Optimization Techniques

prompt_optimizer.py

def optimize_system_prompt(prompt: str) -> dict:
    """Analyze a system prompt and suggest optimizations.

    Returns a dict with the original token count,
    specific recommendations, and estimated savings.
    """
    token_count = count_tokens(prompt)
    recommendations = []

    # Check for common waste patterns
    lines = prompt.split("\n")

    # 1. Redundant instructions
    seen_instructions = set()
    for i, line in enumerate(lines):
        normalized = line.strip().lower()
        if normalized in seen_instructions and len(normalized) > 20:
            recommendations.append(
                f"Line {i+1}: Duplicate instruction detected"
            )
        seen_instructions.add(normalized)

    # 2. Verbose phrasing
    verbose_patterns = {
        "I want you to act as": "You are",
        "Please make sure to": "",
        "It is important that you": "",
        "You should always remember to": "",
        "Under no circumstances should you ever": "Never",
    }
    for verbose, concise in verbose_patterns.items():
        if verbose.lower() in prompt.lower():
            replacement = f"Replace with '{concise}'" if concise else "Remove"
            recommendations.append(
                f"Verbose phrasing: '{verbose}' -> {replacement}"
            )

    # 3. Few-shot example length
    if prompt.count("Example:") > 3 or prompt.count("###") > 6:
        recommendations.append(
            "Consider reducing few-shot examples to 2-3 "
            "representative cases instead of exhaustive coverage"
        )

    return {
        "original_tokens": token_count,
        "recommendations": recommendations,
        "estimated_savings_pct": min(len(recommendations) * 5, 40),
    }

Context Window Management

context_manager.py

from typing import List, Dict


def manage_conversation_context(
    messages: List[Dict[str, str]],
    max_context_tokens: int = 4000,
    model: str = "gpt-4o",
) -> List[Dict[str, str]]:
    """Trim conversation history to fit within a token budget.

    Strategy: Keep the system prompt and last N messages,
    summarize older messages if needed.
    """
    if not messages:
        return messages

    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]

    # Always keep system prompt
    system_tokens = sum(count_tokens(m["content"]) for m in system_msgs)
    remaining_budget = max_context_tokens - system_tokens

    # Work backwards from most recent, keeping messages
    # until we exhaust the budget
    kept_messages = []
    used_tokens = 0

    for msg in reversed(non_system):
        msg_tokens = count_tokens(msg["content"])
        if used_tokens + msg_tokens > remaining_budget:
            break
        kept_messages.insert(0, msg)
        used_tokens += msg_tokens

    return system_msgs + kept_messages


def select_rag_context(
    query: str,
    retrieved_chunks: List[Dict],
    max_context_tokens: int = 3000,
) -> List[Dict]:
    """Select the most relevant RAG chunks within a token budget.

    Chunks should already be sorted by relevance score.
    We greedily add chunks until the budget is exhausted.
    """
    selected = []
    used_tokens = 0

    for chunk in retrieved_chunks:
        chunk_tokens = count_tokens(chunk["text"])
        if used_tokens + chunk_tokens > max_context_tokens:
            continue  # Skip this chunk, try smaller ones
        selected.append(chunk)
        used_tokens += chunk_tokens

    return selected

Model Selection & Routing

Model Tier	Example Models	Best For	Relative Cost	Latency Profile
Frontier	GPT-4o, Claude Sonnet 4	Complex reasoning, nuanced generation, multi-step analysis, code generation with architectural decisions	High (baseline)	1-5s typical, up to 30s for long outputs
Mid-Tier	GPT-4o-mini, Claude Haiku 3.5	Straightforward generation, summarization, translation, simple code tasks, structured extraction	5-20x cheaper than frontier	200ms-2s typical
Lightweight / Open	Llama 3, Mistral, Phi-3	Classification, entity extraction, simple formatting, high-volume low-complexity tasks	10-50x cheaper (self-hosted) or free (local)	50-500ms (self-hosted GPU)
Specialized Fine-Tuned	Custom fine-tunes on smaller base models	Domain-specific tasks where a fine-tuned small model matches or exceeds a general frontier model	Variable -- high upfront training cost, low per-inference cost	Depends on base model size

model_router.py

from enum import Enum
from typing import Optional


class TaskComplexity(Enum):
    SIMPLE = "simple"
    MODERATE = "moderate"
    COMPLEX = "complex"


# Map complexity to model -- adjust based on your quality requirements
ROUTING_TABLE = {
    TaskComplexity.SIMPLE: "gpt-4o-mini",
    TaskComplexity.MODERATE: "gpt-4o-mini",
    TaskComplexity.COMPLEX: "gpt-4o",
}


def classify_task_complexity(
    task_type: str,
    input_tokens: int,
    requires_reasoning: bool = False,
    requires_code_generation: bool = False,
) -> TaskComplexity:
    """Rule-based task complexity classifier.

    Start with rules, then replace with a trained classifier
    once you have labeled data from production traffic.
    """
    # Tasks that always need frontier models
    if requires_reasoning or requires_code_generation:
        return TaskComplexity.COMPLEX

    # Classification, extraction, and formatting are
    # typically simple regardless of input length
    simple_tasks = {
        "classification", "extraction", "formatting",
        "translation", "sentiment", "summarize_short",
    }
    if task_type in simple_tasks:
        return TaskComplexity.SIMPLE

    # Long-form generation and analysis need more capability
    complex_tasks = {
        "analysis", "long_generation", "multi_step",
        "code_review", "architecture",
    }
    if task_type in complex_tasks:
        return TaskComplexity.COMPLEX

    # Default to moderate for unknown task types
    return TaskComplexity.MODERATE


def route_request(
    task_type: str,
    input_text: str,
    **kwargs,
) -> str:
    """Select the appropriate model for a given request."""
    input_tokens = count_tokens(input_text)
    complexity = classify_task_complexity(
        task_type, input_tokens, **kwargs
    )
    return ROUTING_TABLE[complexity]

Caching Strategies

Semantic Caching

semantic_cache.py

import hashlib
import numpy as np
from typing import Optional, Tuple
import time


class SemanticCache:
    """LLM response cache using embedding similarity.

    Uses cosine similarity to match semantically equivalent
    queries, avoiding redundant LLM calls.
    """

    def __init__(
        self,
        similarity_threshold: float = 0.95,
        max_entries: int = 10_000,
        ttl_seconds: int = 3600,
    ):
        self.threshold = similarity_threshold
        self.max_entries = max_entries
        self.ttl = ttl_seconds
        self.cache: dict[str, dict] = {}
        self.embeddings: list[Tuple[str, np.ndarray]] = []
        self.stats = {"hits": 0, "misses": 0, "evictions": 0}

    def _cosine_similarity(
        self, a: np.ndarray, b: np.ndarray
    ) -> float:
        return float(
            np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
        )

    def get(
        self, query_embedding: np.ndarray
    ) -> Optional[str]:
        """Look up a semantically similar cached response."""
        best_score = 0.0
        best_key = None

        for key, emb in self.embeddings:
            # Skip expired entries
            entry = self.cache.get(key)
            if not entry:
                continue
            if time.time() - entry["timestamp"] > self.ttl:
                continue

            score = self._cosine_similarity(query_embedding, emb)
            if score > best_score:
                best_score = score
                best_key = key

        if best_score >= self.threshold and best_key:
            self.stats["hits"] += 1
            return self.cache[best_key]["response"]

        self.stats["misses"] += 1
        return None

    def put(
        self,
        query: str,
        query_embedding: np.ndarray,
        response: str,
    ) -> None:
        """Store a query-response pair in the cache."""
        key = hashlib.sha256(query.encode()).hexdigest()

        # Evict oldest if at capacity
        if len(self.cache) >= self.max_entries:
            oldest_key = min(
                self.cache, key=lambda k: self.cache[k]["timestamp"]
            )
            del self.cache[oldest_key]
            self.embeddings = [
                (k, e) for k, e in self.embeddings if k != oldest_key
            ]
            self.stats["evictions"] += 1

        self.cache[key] = {
            "query": query,
            "response": response,
            "timestamp": time.time(),
        }
        self.embeddings.append((key, query_embedding))

    @property
    def hit_rate(self) -> float:
        total = self.stats["hits"] + self.stats["misses"]
        return self.stats["hits"] / total if total > 0 else 0.0

Prompt Caching (Provider-Level)

prompt_caching_example.py

# Anthropic prompt caching: mark cacheable content with
# cache_control to avoid re-processing stable prefixes.
#
# The system prompt and few-shot examples are cached on first
# request. Subsequent requests with the same prefix pay only
# the cache read cost (typically 90% cheaper than processing).

def build_cached_request(user_query: str) -> dict:
    """Build an API request that leverages prompt caching.

    The system prompt and examples are stable across requests,
    so they benefit from caching. Only the user query changes.
    """
    return {
        "model": "claude-sonnet-4-20250514",
        "max_tokens": 1024,
        "system": [
            {
                "type": "text",
                "text": LONG_SYSTEM_PROMPT,  # 2000+ tokens
                "cache_control": {"type": "ephemeral"},
            }
        ],
        "messages": [
            {"role": "user", "content": user_query},
        ],
    }


# Cost comparison for a 3000-token system prompt:
# Without caching: 3000 input tokens charged at full rate
# With caching (first request): 3000 tokens + small cache write fee
# With caching (subsequent): ~300 token equivalent (90% savings)
# Break-even: typically after 2-3 requests with the same prefix

Embedding Cache

embedding_cache.py

import hashlib
import json
from typing import Optional, List
import redis


class EmbeddingCache:
    """Cache embeddings by content hash to avoid redundant
    embedding API calls.

    Especially valuable when:
    - Rebuilding vector indices after code changes
    - Processing documents that overlap across pipelines
    - Running evaluation suites with fixed test data
    """

    def __init__(self, redis_url: str = "redis://localhost:6379"):
        self.redis = redis.from_url(redis_url)
        self.prefix = "emb_cache:"

    def _content_hash(self, text: str, model: str) -> str:
        """Hash content + model to create cache key."""
        key_input = f"{model}:{text}"
        return hashlib.sha256(key_input.encode()).hexdigest()

    def get(
        self, text: str, model: str
    ) -> Optional[List[float]]:
        """Retrieve cached embedding if available."""
        key = self.prefix + self._content_hash(text, model)
        cached = self.redis.get(key)
        if cached:
            return json.loads(cached)
        return None

    def put(
        self, text: str, model: str, embedding: List[float],
        ttl_seconds: int = 86400 * 30,  # 30 days default
    ) -> None:
        """Cache an embedding with expiration."""
        key = self.prefix + self._content_hash(text, model)
        self.redis.setex(key, ttl_seconds, json.dumps(embedding))

    def get_or_compute(
        self,
        text: str,
        model: str,
        compute_fn,
    ) -> List[float]:
        """Get from cache or compute and cache."""
        cached = self.get(text, model)
        if cached is not None:
            return cached

        embedding = compute_fn(text, model)
        self.put(text, model, embedding)
        return embedding

Cost Calculator

Quick Presets

API

LLM API Calls (frontier model)50,000 per 1K requests = $175.0k/mo

07,500,000

LLM API Calls (mid-tier model)50,000 per 1K requests = $15.0k/mo

07,500,000

Embedding Generation50,000 per 1K requests = $1.0k/mo

07,500,000

Storage

Vector Database Hosting10 per GB = $2.50/mo

01,500

Model Artifact Storage10 per GB = $0.23/mo

01,500

Compute

GPU Compute (inference)50,000 per 1K requests = $25.0k/mo

07,500,000

Tools

Monitoring & Observability3 per seat = $150.00/mo

045

Experiment Tracking3 per seat = $90.00/mo

045

Estimated Monthly Cost

$216.2k

$2594.9k/year

Tier Comparison

Startup$216.2k/mo

Growth$2160.8k/mo

Enterprise$21602.7k/mo

Cost Breakdown

LLM API Calls (frontier model)$175.0k

LLM API Calls (mid-tier model)$15.0k

Embedding Generation$1.0k

Vector Database Hosting$2.50

Model Artifact Storage$0.23

GPU Compute (inference)$25.0k

Monitoring & Observability$150.00

Experiment Tracking$90.00

Batch vs Real-Time Processing

Before

After

batch_processor.py

import asyncio
from dataclasses import dataclass
from typing import List, Callable, Any
from collections import deque
import time


@dataclass
class BatchItem:
    """A single item in the processing queue."""
    payload: dict
    callback: Callable
    enqueued_at: float = 0.0


class BatchProcessor:
    """Collect requests and process them in batches.

    Flushes when batch is full or max_wait_seconds elapses,
    whichever comes first. This lets you use batch API pricing
    while keeping latency bounded.
    """

    def __init__(
        self,
        process_batch_fn: Callable,
        batch_size: int = 20,
        max_wait_seconds: float = 5.0,
    ):
        self.process_fn = process_batch_fn
        self.batch_size = batch_size
        self.max_wait = max_wait_seconds
        self.queue: deque[BatchItem] = deque()
        self.stats = {
            "batches_processed": 0,
            "items_processed": 0,
            "avg_batch_size": 0.0,
        }

    async def enqueue(self, payload: dict) -> Any:
        """Add a request to the batch queue.

        Returns a future that resolves when the batch
        containing this item is processed.
        """
        future = asyncio.get_event_loop().create_future()
        item = BatchItem(
            payload=payload,
            callback=lambda result: future.set_result(result),
            enqueued_at=time.time(),
        )
        self.queue.append(item)

        # Flush if batch is full
        if len(self.queue) >= self.batch_size:
            await self._flush()

        return await future

    async def _flush(self) -> None:
        """Process all queued items as a single batch."""
        if not self.queue:
            return

        batch = []
        while self.queue and len(batch) < self.batch_size:
            batch.append(self.queue.popleft())

        payloads = [item.payload for item in batch]
        results = await self.process_fn(payloads)

        for item, result in zip(batch, results):
            item.callback(result)

        # Update stats
        self.stats["batches_processed"] += 1
        self.stats["items_processed"] += len(batch)
        total = self.stats["items_processed"]
        batches = self.stats["batches_processed"]
        self.stats["avg_batch_size"] = total / batches

Infrastructure Optimization

GPU Instance Selection

Spot and Preemptible Instances

Auto-Scaling Patterns

scaling_config.py

# Example auto-scaling configuration for an inference endpoint.
# These values are starting points -- tune based on your
# traffic patterns and latency requirements.

SCALING_CONFIG = {
    "min_replicas": 1,        # Never scale to zero for prod
    "max_replicas": 10,       # Budget cap
    "target_gpu_utilization": 0.7,  # Scale up above 70%

    # Scale-up: react quickly to traffic spikes
    "scale_up_cooldown_seconds": 60,
    "scale_up_threshold_duration_seconds": 30,

    # Scale-down: be more conservative to avoid thrashing
    "scale_down_cooldown_seconds": 300,
    "scale_down_threshold_duration_seconds": 120,

    # Queue-based scaling: scale on pending requests,
    # not just GPU utilization
    "queue_length_per_replica": 5,
    "max_queue_wait_seconds": 10,
}


def calculate_required_replicas(
    current_rps: float,
    avg_inference_time_ms: float,
    target_utilization: float = 0.7,
) -> int:
    """Estimate required replicas for a target utilization.

    Args:
        current_rps: current requests per second
        avg_inference_time_ms: average inference latency
        target_utilization: target GPU utilization (0-1)
    """
    # Each replica can handle (1000 / avg_inference_time_ms) RPS
    # at 100% utilization
    capacity_per_replica = 1000 / avg_inference_time_ms
    effective_capacity = capacity_per_replica * target_utilization

    import math
    replicas = math.ceil(current_rps / effective_capacity)
    return max(
        SCALING_CONFIG["min_replicas"],
        min(replicas, SCALING_CONFIG["max_replicas"]),
    )

Cost Monitoring & Alerting

cost_monitor.py

from dataclasses import dataclass, field
from typing import Dict, List, Optional
from collections import defaultdict
import time


@dataclass
class CostAlert:
    """A triggered cost alert."""
    alert_type: str
    message: str
    current_value: float
    threshold: float
    timestamp: float


class CostMonitor:
    """Track and alert on AI spending by feature and model.

    Designed for integration with your observability stack.
    Emit metrics to Prometheus/Datadog/CloudWatch as needed.
    """

    def __init__(self):
        self.hourly_costs: Dict[str, float] = defaultdict(float)
        self.daily_costs: Dict[str, float] = defaultdict(float)
        self.feature_costs: Dict[str, float] = defaultdict(float)
        self.alerts: List[CostAlert] = []

        # Configure thresholds per feature
        self.hourly_thresholds: Dict[str, float] = {}
        self.daily_thresholds: Dict[str, float] = {}

    def record_cost(
        self,
        cost_usd: float,
        model: str,
        feature: str,
    ) -> Optional[CostAlert]:
        """Record a cost event and check thresholds."""
        hour_key = f"{feature}:{int(time.time() // 3600)}"
        day_key = f"{feature}:{int(time.time() // 86400)}"

        self.hourly_costs[hour_key] += cost_usd
        self.daily_costs[day_key] += cost_usd
        self.feature_costs[feature] += cost_usd

        # Check hourly threshold
        hourly_threshold = self.hourly_thresholds.get(feature)
        if hourly_threshold:
            current_hourly = self.hourly_costs[hour_key]
            if current_hourly > hourly_threshold:
                alert = CostAlert(
                    alert_type="hourly_budget_exceeded",
                    message=(
                        f"Feature '{feature}' spent "
                        f"${current_hourly:.2f} this hour "
                        f"(threshold: ${hourly_threshold:.2f})"
                    ),
                    current_value=current_hourly,
                    threshold=hourly_threshold,
                    timestamp=time.time(),
                )
                self.alerts.append(alert)
                return alert

        return None

    def get_cost_summary(self) -> Dict[str, float]:
        """Get current cost summary by feature."""
        return dict(self.feature_costs)

    def set_threshold(
        self,
        feature: str,
        hourly: Optional[float] = None,
        daily: Optional[float] = None,
    ) -> None:
        """Set cost alert thresholds for a feature."""
        if hourly is not None:
            self.hourly_thresholds[feature] = hourly
        if daily is not None:
            self.daily_thresholds[feature] = daily

Quick Wins Checklist

Immediate Wins (This Week)

Short-Term (This Month)

Strategic (This Quarter)

Version History

1.0.0 · 2026-03-01

• Initial release covering token optimization, model routing, caching strategies, batch processing, infrastructure optimization, and cost monitoring
• Interactive cost calculator with three scale tiers
• Architecture diagram showing six primary cost centers
• Code examples for semantic caching, model routing, batch processing, and cost monitoring
• Quick wins checklist organized by implementation timeframe

The AI Cost Problem

Cost Anatomy: Where the Money Goes

Token Analysis & Optimization

Token Counting and Tracking

Prompt Optimization Techniques

Context Window Management

Model Selection & Routing

Caching Strategies

Semantic Caching

Prompt Caching (Provider-Level)

Embedding Cache

Cost Calculator

Batch vs Real-Time Processing

Infrastructure Optimization

GPU Instance Selection

Spot and Preemptible Instances

Auto-Scaling Patterns

Cost Monitoring & Alerting

Quick Wins Checklist

Immediate Wins (This Week)

Short-Term (This Month)

Strategic (This Quarter)

Version History

Related content

The AI Cost Problem

Cost Anatomy: Where the Money Goes

User Request

Model Router

LLM API Calls

Compute / GPU

Embeddings

Storage

Monitoring

Dev & Tooling

Token Analysis & Optimization

Token Counting and Tracking

Prompt Optimization Techniques

Context Window Management

Model Selection & Routing

Caching Strategies

Semantic Caching

Prompt Caching (Provider-Level)

Embedding Cache

Cost Calculator

Batch vs Real-Time Processing

Infrastructure Optimization

GPU Instance Selection

Spot and Preemptible Instances

Auto-Scaling Patterns

Cost Monitoring & Alerting

Quick Wins Checklist

Immediate Wins (This Week)

Short-Term (This Month)

Strategic (This Quarter)

Version History

Related content

User Request

Model Router

LLM API Calls

Compute / GPU

Embeddings

Storage

Monitoring

Dev & Tooling