Key Takeaway

By the end of this blueprint you will have a model routing layer that classifies incoming requests by complexity, routes simple tasks to smaller models, uses cascading with confidence checks for ambiguous requests, and provides fallback chains for provider outages — reducing LLM costs while maintaining quality.

Prerequisites

An LLM gateway or direct access to multiple model providers (see LLM Gateway blueprint)
Python 3.11+ for the routing logic
Access to at least two model tiers (e.g., Claude Haiku and Claude Sonnet, or GPT-4o-mini and GPT-4o)
An evaluation dataset to validate routing quality (see LLM Evaluation blueprint)
Redis for routing decision caching and metrics

Routing Strategies

There are three primary routing strategies, each suited to different scenarios. Classification-based routing uses a lightweight model to score request complexity and routes based on the score. Cascading starts with the cheapest model and escalates only when confidence is low. Rules-based routing uses deterministic rules (request type, user tier, feature) to select models without an LLM call. In practice, production systems combine all three: rules handle known patterns, classification routes ambiguous requests, and cascading provides a safety net.

Strategy	Latency Overhead	Cost Savings	Quality Risk	Best For
Classification	50-200ms (classifier call)	40-60%	Low with good classifier	High-volume mixed workloads
Cascading	None (cheap model first)	30-50%	Very low (always escalates)	Quality-critical applications
Rules-based	None	20-40%	None (deterministic)	Known task types, user tiers
Hybrid (all three)	0-200ms	50-70%	Lowest	Production systems at scale

Complexity Classifier

The complexity classifier is a fast, cheap LLM call that reads the user request and estimates how difficult it is. It returns a complexity level (simple, moderate, complex) that the routing engine maps to a model tier. The classifier itself runs on the cheapest available model since its task — estimating difficulty — is inherently simple. The key is calibrating the classifier against your evaluation dataset to ensure it routes accurately for your specific domain.

routing/classifier.py

"""Request complexity classifier for model routing."""

from __future__ import annotations

from enum import Enum
from typing import Literal

from anthropic import Anthropic
from pydantic import BaseModel, Field

client = Anthropic()


class Complexity(str, Enum):
    SIMPLE = "simple"
    MODERATE = "moderate"
    COMPLEX = "complex"


class ClassificationResult(BaseModel):
    complexity: Complexity
    reasoning: str = Field(max_length=100)
    estimated_tokens: int = Field(description="Estimated output tokens needed")


CLASSIFIER_PROMPT = """Classify this request's complexity for LLM routing.

SIMPLE: Factual questions, simple formatting, classification, extraction from provided text.
MODERATE: Multi-step reasoning, summarization with analysis, code with moderate logic.
COMPLEX: Creative writing, complex multi-step reasoning, code architecture, nuanced analysis.

Request: {request}

Return JSON with complexity, reasoning (brief), and estimated_tokens."""


async def classify_request(request: str) -> ClassificationResult:
    """Classify request complexity using the cheapest model.

    This call adds ~50-100ms latency but saves 40-60% on model costs
    by routing simple requests to cheaper models.
    """
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Cheapest model
        max_tokens=256,
        messages=[{
            "role": "user",
            "content": CLASSIFIER_PROMPT.format(request=request),
        }],
    )
    import json
    data = json.loads(response.content[0].text)
    return ClassificationResult(**data)

Model Routing Engine

routing/router.py

"""Multi-model routing engine with cascading and fallback."""

from __future__ import annotations

import logging
from dataclasses import dataclass
from typing import Any

from routing.classifier import Complexity, classify_request

logger = logging.getLogger(__name__)


@dataclass
class ModelConfig:
    name: str
    provider: str
    input_cost_per_1k: float   # USD per 1K input tokens
    output_cost_per_1k: float  # USD per 1K output tokens
    max_tokens: int
    latency_p50_ms: int        # Typical latency


# Model tiers ordered by cost (cheapest first)
MODEL_TIERS: dict[Complexity, list[ModelConfig]] = {
    Complexity.SIMPLE: [
        ModelConfig("claude-haiku-4-5-20251001", "anthropic", 0.0008, 0.004, 4096, 300),
        ModelConfig("gpt-4o-mini", "openai", 0.00015, 0.0006, 4096, 250),
    ],
    Complexity.MODERATE: [
        ModelConfig("claude-sonnet-4-20250514", "anthropic", 0.003, 0.015, 8192, 800),
        ModelConfig("gpt-4o", "openai", 0.0025, 0.01, 8192, 700),
    ],
    Complexity.COMPLEX: [
        ModelConfig("claude-sonnet-4-20250514", "anthropic", 0.003, 0.015, 8192, 800),
        ModelConfig("gpt-4o", "openai", 0.0025, 0.01, 8192, 700),
    ],
}


class ModelRouter:
    """Routes requests to optimal models based on complexity."""

    def __init__(self, cascade_enabled: bool = True):
        self.cascade_enabled = cascade_enabled

    async def route(
        self,
        request: str,
        constraints: dict[str, Any] | None = None,
    ) -> ModelConfig:
        """Select the optimal model for a request.

        Args:
            request: The user's request text.
            constraints: Optional overrides (max_latency_ms, max_cost, model_tier).

        Returns:
            Selected ModelConfig.
        """
        # Check for explicit overrides
        if constraints and "model_tier" in constraints:
            tier = Complexity(constraints["model_tier"])
        else:
            # Classify the request
            classification = await classify_request(request)
            tier = classification.complexity
            logger.info(
                "Classified as %s: %s",
                tier.value, classification.reasoning,
            )

        # Get model list for this tier
        models = MODEL_TIERS[tier]

        # Apply latency constraint if specified
        if constraints and "max_latency_ms" in constraints:
            max_lat = constraints["max_latency_ms"]
            models = [m for m in models if m.latency_p50_ms <= max_lat]

        if not models:
            # Fallback to cheapest available model
            logger.warning("No models match constraints, falling back")
            models = MODEL_TIERS[Complexity.SIMPLE]

        # Return first available (cheapest for the tier)
        return models[0]

Cascading with Confidence Checks

Cascading starts with the cheapest model and only escalates when the output quality is insufficient. The confidence check examines the response for signals of low quality: hedging language ('I am not sure', 'it depends'), very short responses to complex questions, or low self-reported confidence when the model is prompted to rate its own answer. If the confidence check fails, the same request is sent to the next model tier with the original context. This adds latency for escalated requests but saves significantly on the majority of requests that the cheap model handles well.

Track your escalation rate as a key metric. If more than 30% of requests escalate from the cheap model, your classifier is under-routing or your confidence threshold is too aggressive. If fewer than 5% escalate, you might be over-routing to expensive models. Target a 10-20% escalation rate for most workloads.

Fallback Chains for Reliability

Every model in your routing table should have a fallback. When the primary model returns a 5xx error, rate limit, or timeout, the router tries the next model in the tier's fallback list. Cross-provider fallbacks (Anthropic to OpenAI, or vice versa) protect against single-provider outages. Track fallback activations as an operational metric — a sustained fallback rate indicates a provider issue that needs investigation.

Cascading and fallback are different mechanisms that serve different purposes. Cascading escalates to a better model when quality is low; fallback switches to an alternate provider when the primary is unavailable. Never combine them in a way that cascades to a worse model — that defeats the purpose.

Routing Logic

Quality Assurance

Operations

Version History

1.0.0 · 2026-03-01

• Initial publication with classification-based, cascading, and rules-based routing
• Complexity classifier using Claude Haiku
• Multi-tier model routing engine with cost optimization
• Cascading confidence checks and fallback chain patterns
• Routing strategy comparison table

Key Takeaway

Prerequisites

An LLM gateway or direct access to multiple model providers (see LLM Gateway blueprint)
Python 3.11+ for the routing logic
Access to at least two model tiers (e.g., Claude Haiku and Claude Sonnet, or GPT-4o-mini and GPT-4o)
An evaluation dataset to validate routing quality (see LLM Evaluation blueprint)
Redis for routing decision caching and metrics

Routing Strategies

Strategy	Latency Overhead	Cost Savings	Quality Risk	Best For
Classification	50-200ms (classifier call)	40-60%	Low with good classifier	High-volume mixed workloads
Cascading	None (cheap model first)	30-50%	Very low (always escalates)	Quality-critical applications
Rules-based	None	20-40%	None (deterministic)	Known task types, user tiers
Hybrid (all three)	0-200ms	50-70%	Lowest	Production systems at scale

Complexity Classifier

routing/classifier.py

"""Request complexity classifier for model routing."""

from __future__ import annotations

from enum import Enum
from typing import Literal

from anthropic import Anthropic
from pydantic import BaseModel, Field

client = Anthropic()


class Complexity(str, Enum):
    SIMPLE = "simple"
    MODERATE = "moderate"
    COMPLEX = "complex"


class ClassificationResult(BaseModel):
    complexity: Complexity
    reasoning: str = Field(max_length=100)
    estimated_tokens: int = Field(description="Estimated output tokens needed")


CLASSIFIER_PROMPT = """Classify this request's complexity for LLM routing.

SIMPLE: Factual questions, simple formatting, classification, extraction from provided text.
MODERATE: Multi-step reasoning, summarization with analysis, code with moderate logic.
COMPLEX: Creative writing, complex multi-step reasoning, code architecture, nuanced analysis.

Request: {request}

Return JSON with complexity, reasoning (brief), and estimated_tokens."""


async def classify_request(request: str) -> ClassificationResult:
    """Classify request complexity using the cheapest model.

    This call adds ~50-100ms latency but saves 40-60% on model costs
    by routing simple requests to cheaper models.
    """
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Cheapest model
        max_tokens=256,
        messages=[{
            "role": "user",
            "content": CLASSIFIER_PROMPT.format(request=request),
        }],
    )
    import json
    data = json.loads(response.content[0].text)
    return ClassificationResult(**data)

Model Routing Engine

routing/router.py

"""Multi-model routing engine with cascading and fallback."""

from __future__ import annotations

import logging
from dataclasses import dataclass
from typing import Any

from routing.classifier import Complexity, classify_request

logger = logging.getLogger(__name__)


@dataclass
class ModelConfig:
    name: str
    provider: str
    input_cost_per_1k: float   # USD per 1K input tokens
    output_cost_per_1k: float  # USD per 1K output tokens
    max_tokens: int
    latency_p50_ms: int        # Typical latency


# Model tiers ordered by cost (cheapest first)
MODEL_TIERS: dict[Complexity, list[ModelConfig]] = {
    Complexity.SIMPLE: [
        ModelConfig("claude-haiku-4-5-20251001", "anthropic", 0.0008, 0.004, 4096, 300),
        ModelConfig("gpt-4o-mini", "openai", 0.00015, 0.0006, 4096, 250),
    ],
    Complexity.MODERATE: [
        ModelConfig("claude-sonnet-4-20250514", "anthropic", 0.003, 0.015, 8192, 800),
        ModelConfig("gpt-4o", "openai", 0.0025, 0.01, 8192, 700),
    ],
    Complexity.COMPLEX: [
        ModelConfig("claude-sonnet-4-20250514", "anthropic", 0.003, 0.015, 8192, 800),
        ModelConfig("gpt-4o", "openai", 0.0025, 0.01, 8192, 700),
    ],
}


class ModelRouter:
    """Routes requests to optimal models based on complexity."""

    def __init__(self, cascade_enabled: bool = True):
        self.cascade_enabled = cascade_enabled

    async def route(
        self,
        request: str,
        constraints: dict[str, Any] | None = None,
    ) -> ModelConfig:
        """Select the optimal model for a request.

        Args:
            request: The user's request text.
            constraints: Optional overrides (max_latency_ms, max_cost, model_tier).

        Returns:
            Selected ModelConfig.
        """
        # Check for explicit overrides
        if constraints and "model_tier" in constraints:
            tier = Complexity(constraints["model_tier"])
        else:
            # Classify the request
            classification = await classify_request(request)
            tier = classification.complexity
            logger.info(
                "Classified as %s: %s",
                tier.value, classification.reasoning,
            )

        # Get model list for this tier
        models = MODEL_TIERS[tier]

        # Apply latency constraint if specified
        if constraints and "max_latency_ms" in constraints:
            max_lat = constraints["max_latency_ms"]
            models = [m for m in models if m.latency_p50_ms <= max_lat]

        if not models:
            # Fallback to cheapest available model
            logger.warning("No models match constraints, falling back")
            models = MODEL_TIERS[Complexity.SIMPLE]

        # Return first available (cheapest for the tier)
        return models[0]

Cascading with Confidence Checks

Fallback Chains for Reliability

Routing Logic

Quality Assurance

Operations

Version History

1.0.0 · 2026-03-01

• Initial publication with classification-based, cascading, and rules-based routing
• Complexity classifier using Claude Haiku
• Multi-tier model routing engine with cost optimization
• Cascading confidence checks and fallback chain patterns
• Routing strategy comparison table

Multi-Model Routing & Optimization

Routing Strategies

Complexity Classifier

Model Routing Engine

Cascading with Confidence Checks

Fallback Chains for Reliability

Routing Logic

Quality Assurance

Operations

Version History

Related content

Multi-Model Routing & Optimization

Routing Strategies

Complexity Classifier

Model Routing Engine

Cascading with Confidence Checks

Fallback Chains for Reliability

Routing Logic

Quality Assurance

Operations

Version History

Related content