Key Takeaway
By the end of this blueprint you will have a model routing layer that classifies incoming requests by complexity, routes simple tasks to smaller models, uses cascading with confidence checks for ambiguous requests, and provides fallback chains for provider outages — reducing LLM costs while maintaining quality.
Prerequisites
- An LLM gateway or direct access to multiple model providers (see LLM Gateway blueprint)
- Python 3.11+ for the routing logic
- Access to at least two model tiers (e.g., Claude Haiku and Claude Sonnet, or GPT-4o-mini and GPT-4o)
- An evaluation dataset to validate routing quality (see LLM Evaluation blueprint)
- Redis for routing decision caching and metrics
Routing Strategies
There are three primary routing strategies, each suited to different scenarios. Classification-based routing uses a lightweight model to score request complexity and routes based on the score. Cascading starts with the cheapest model and escalates only when confidence is low. Rules-based routing uses deterministic rules (request type, user tier, feature) to select models without an LLM call. In practice, production systems combine all three: rules handle known patterns, classification routes ambiguous requests, and cascading provides a safety net.
| Strategy | Latency Overhead | Cost Savings | Quality Risk | Best For |
|---|---|---|---|---|
| Classification | 50-200ms (classifier call) | 40-60% | Low with good classifier | High-volume mixed workloads |
| Cascading | None (cheap model first) | 30-50% | Very low (always escalates) | Quality-critical applications |
| Rules-based | None | 20-40% | None (deterministic) | Known task types, user tiers |
| Hybrid (all three) | 0-200ms | 50-70% | Lowest | Production systems at scale |
Complexity Classifier
The complexity classifier is a fast, cheap LLM call that reads the user request and estimates how difficult it is. It returns a complexity level (simple, moderate, complex) that the routing engine maps to a model tier. The classifier itself runs on the cheapest available model since its task — estimating difficulty — is inherently simple. The key is calibrating the classifier against your evaluation dataset to ensure it routes accurately for your specific domain.
"""Request complexity classifier for model routing."""
from __future__ import annotations
from enum import Enum
from typing import Literal
from anthropic import Anthropic
from pydantic import BaseModel, Field
client = Anthropic()
class Complexity(str, Enum):
SIMPLE = "simple"
MODERATE = "moderate"
COMPLEX = "complex"
class ClassificationResult(BaseModel):
complexity: Complexity
reasoning: str = Field(max_length=100)
estimated_tokens: int = Field(description="Estimated output tokens needed")
CLASSIFIER_PROMPT = """Classify this request's complexity for LLM routing.
SIMPLE: Factual questions, simple formatting, classification, extraction from provided text.
MODERATE: Multi-step reasoning, summarization with analysis, code with moderate logic.
COMPLEX: Creative writing, complex multi-step reasoning, code architecture, nuanced analysis.
Request: {request}
Return JSON with complexity, reasoning (brief), and estimated_tokens."""
async def classify_request(request: str) -> ClassificationResult:
"""Classify request complexity using the cheapest model.
This call adds ~50-100ms latency but saves 40-60% on model costs
by routing simple requests to cheaper models.
"""
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Cheapest model
max_tokens=256,
messages=[{
"role": "user",
"content": CLASSIFIER_PROMPT.format(request=request),
}],
)
import json
data = json.loads(response.content[0].text)
return ClassificationResult(**data)Model Routing Engine
"""Multi-model routing engine with cascading and fallback."""
from __future__ import annotations
import logging
from dataclasses import dataclass
from typing import Any
from routing.classifier import Complexity, classify_request
logger = logging.getLogger(__name__)
@dataclass
class ModelConfig:
name: str
provider: str
input_cost_per_1k: float # USD per 1K input tokens
output_cost_per_1k: float # USD per 1K output tokens
max_tokens: int
latency_p50_ms: int # Typical latency
# Model tiers ordered by cost (cheapest first)
MODEL_TIERS: dict[Complexity, list[ModelConfig]] = {
Complexity.SIMPLE: [
ModelConfig("claude-haiku-4-5-20251001", "anthropic", 0.0008, 0.004, 4096, 300),
ModelConfig("gpt-4o-mini", "openai", 0.00015, 0.0006, 4096, 250),
],
Complexity.MODERATE: [
ModelConfig("claude-sonnet-4-20250514", "anthropic", 0.003, 0.015, 8192, 800),
ModelConfig("gpt-4o", "openai", 0.0025, 0.01, 8192, 700),
],
Complexity.COMPLEX: [
ModelConfig("claude-sonnet-4-20250514", "anthropic", 0.003, 0.015, 8192, 800),
ModelConfig("gpt-4o", "openai", 0.0025, 0.01, 8192, 700),
],
}
class ModelRouter:
"""Routes requests to optimal models based on complexity."""
def __init__(self, cascade_enabled: bool = True):
self.cascade_enabled = cascade_enabled
async def route(
self,
request: str,
constraints: dict[str, Any] | None = None,
) -> ModelConfig:
"""Select the optimal model for a request.
Args:
request: The user's request text.
constraints: Optional overrides (max_latency_ms, max_cost, model_tier).
Returns:
Selected ModelConfig.
"""
# Check for explicit overrides
if constraints and "model_tier" in constraints:
tier = Complexity(constraints["model_tier"])
else:
# Classify the request
classification = await classify_request(request)
tier = classification.complexity
logger.info(
"Classified as %s: %s",
tier.value, classification.reasoning,
)
# Get model list for this tier
models = MODEL_TIERS[tier]
# Apply latency constraint if specified
if constraints and "max_latency_ms" in constraints:
max_lat = constraints["max_latency_ms"]
models = [m for m in models if m.latency_p50_ms <= max_lat]
if not models:
# Fallback to cheapest available model
logger.warning("No models match constraints, falling back")
models = MODEL_TIERS[Complexity.SIMPLE]
# Return first available (cheapest for the tier)
return models[0]Cascading with Confidence Checks
Cascading starts with the cheapest model and only escalates when the output quality is insufficient. The confidence check examines the response for signals of low quality: hedging language ('I am not sure', 'it depends'), very short responses to complex questions, or low self-reported confidence when the model is prompted to rate its own answer. If the confidence check fails, the same request is sent to the next model tier with the original context. This adds latency for escalated requests but saves significantly on the majority of requests that the cheap model handles well.
Track your escalation rate as a key metric. If more than 30% of requests escalate from the cheap model, your classifier is under-routing or your confidence threshold is too aggressive. If fewer than 5% escalate, you might be over-routing to expensive models. Target a 10-20% escalation rate for most workloads.
Fallback Chains for Reliability
Every model in your routing table should have a fallback. When the primary model returns a 5xx error, rate limit, or timeout, the router tries the next model in the tier's fallback list. Cross-provider fallbacks (Anthropic to OpenAI, or vice versa) protect against single-provider outages. Track fallback activations as an operational metric — a sustained fallback rate indicates a provider issue that needs investigation.
Cascading and fallback are different mechanisms that serve different purposes. Cascading escalates to a better model when quality is low; fallback switches to an alternate provider when the primary is unavailable. Never combine them in a way that cascades to a worse model — that defeats the purpose.
Routing Logic
Quality Assurance
Operations
Version History
1.0.0 · 2026-03-01
- • Initial publication with classification-based, cascading, and rules-based routing
- • Complexity classifier using Claude Haiku
- • Multi-tier model routing engine with cost optimization
- • Cascading confidence checks and fallback chain patterns
- • Routing strategy comparison table