Key Takeaway

By the end of this blueprint you will have a production LLM gateway built on LiteLLM that provides a unified OpenAI-compatible API across Anthropic, OpenAI, and open-source providers, with per-team rate limiting, budget enforcement, automatic failover, and structured telemetry for every request.

Prerequisites

Python 3.11+ or Docker for running LiteLLM Proxy
PostgreSQL or Redis for rate limit state and usage tracking
API keys for at least two LLM providers (Anthropic, OpenAI, or self-hosted)
Basic understanding of reverse proxies and HTTP middleware patterns
A monitoring stack (Prometheus + Grafana or equivalent) for dashboarding

Why a Centralized LLM Gateway?

Without a gateway, every team manages their own API keys, rate limits, and provider integrations. This creates three problems that compound as you scale: cost visibility disappears because spend is scattered across dozens of API keys with no central attribution; security weakens because API keys are embedded in application configs across repositories; and reliability suffers because each application must implement its own retry and failover logic. A gateway centralizes all of this into a single layer that platform teams operate.

Architecture Overview

The gateway sits as a reverse proxy between application teams and LLM providers. Incoming requests pass through an authentication layer, a rate-limit and budget-enforcement layer, and a routing layer that selects the optimal provider based on model requirements, latency, and cost. Response streams are proxied back with injected telemetry headers for downstream observability.

Setting Up LiteLLM Proxy

LiteLLM is the most mature open-source LLM proxy. It normalizes 100+ provider APIs into the OpenAI chat completions format, handles streaming, and provides built-in support for rate limiting, budget tracking, and model aliasing. We run it as a Docker container behind a load balancer, with a PostgreSQL backend for persistent configuration and usage tracking.

litellm_config.yaml

# LiteLLM Proxy configuration
model_list:
  # Primary: Anthropic Claude
  - model_name: "claude-sonnet"
    litellm_params:
      model: "anthropic/claude-sonnet-4-20250514"
      api_key: "os.environ/ANTHROPIC_API_KEY"
      max_tokens: 8192
      timeout: 120
    model_info:
      input_cost_per_token: 0.000003
      output_cost_per_token: 0.000015

  # Fallback: OpenAI GPT-4o
  - model_name: "claude-sonnet"  # Same name = automatic fallback
    litellm_params:
      model: "openai/gpt-4o"
      api_key: "os.environ/OPENAI_API_KEY"
      max_tokens: 8192
      timeout: 120
    model_info:
      input_cost_per_token: 0.0000025
      output_cost_per_token: 0.00001

  # Fast model for classification
  - model_name: "fast"
    litellm_params:
      model: "anthropic/claude-haiku-4-5-20251001"
      api_key: "os.environ/ANTHROPIC_API_KEY"
      max_tokens: 4096
      timeout: 30

litellm_settings:
  drop_params: true            # Drop unsupported params per provider
  set_verbose: false
  cache: true
  cache_params:
    type: "redis"
    host: "os.environ/REDIS_HOST"
    port: 6379
    ttl: 3600

general_settings:
  master_key: "os.environ/LITELLM_MASTER_KEY"
  database_url: "os.environ/DATABASE_URL"
  alerting:
    - "slack"
  alert_types:
    - "llm_exceptions"
    - "budget_alerts"

docker-compose.yml

version: "3.9"
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm
      - REDIS_HOST=redis
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    depends_on:
      - postgres
      - redis

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: password
    volumes:
      - pgdata:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

volumes:
  pgdata:

Per-Team Rate Limiting and Budgets

LiteLLM supports virtual keys that map to teams or projects. Each key gets its own rate limit (requests per minute) and budget ceiling (maximum spend per month). When a team hits their budget, the gateway returns a 429 with a clear message rather than silently draining the organization's API credits. This is the single most important governance feature — without it, a runaway batch job in one team can exhaust the entire organization's API budget in hours.

gateway/team_setup.py

"""Create teams and virtual keys via the LiteLLM management API."""

import httpx

GATEWAY_URL = "http://localhost:4000"
MASTER_KEY = "sk-master-..."


async def create_team_and_key(
    team_name: str,
    monthly_budget_usd: float,
    rpm_limit: int = 100,
    models: list[str] | None = None,
) -> dict:
    """Provision a team with budget limits and generate an API key.

    Args:
        team_name: Human-readable team identifier.
        monthly_budget_usd: Maximum monthly spend in USD.
        rpm_limit: Requests per minute limit.
        models: Allowed model names (None = all models).

    Returns:
        Dict with team_id and api_key for the team.
    """
    async with httpx.AsyncClient() as client:
        # Create team
        team_resp = await client.post(
            f"{GATEWAY_URL}/team/new",
            headers={"Authorization": f"Bearer {MASTER_KEY}"},
            json={
                "team_alias": team_name,
                "max_budget": monthly_budget_usd,
                "rpm_limit": rpm_limit,
                "models": models or [],
                "budget_duration": "1mo",
            },
        )
        team_data = team_resp.json()

        # Generate key for the team
        key_resp = await client.post(
            f"{GATEWAY_URL}/key/generate",
            headers={"Authorization": f"Bearer {MASTER_KEY}"},
            json={
                "team_id": team_data["team_id"],
                "key_alias": f"{team_name}-production",
                "max_budget": monthly_budget_usd,
                "rpm_limit": rpm_limit,
            },
        )
        key_data = key_resp.json()

        return {
            "team_id": team_data["team_id"],
            "api_key": key_data["key"],
        }

Set budget alerts at 80% of ceiling so teams get a warning before they hit their limit. LiteLLM supports Slack and email alerting natively. Configure alert_types to include 'budget_alerts' in your config.

Provider Failover and Health Checks

The gateway should automatically fail over to a backup provider when the primary returns errors or exceeds latency thresholds. LiteLLM handles this by trying providers in order when multiple models share the same model_name. For finer control, implement a health check loop that probes each provider with a lightweight completion request every 30 seconds. Mark unhealthy providers as unavailable in the routing table until they recover.

Gateway Feature Comparison

Feature	LiteLLM	Custom Proxy	Cloud API Gateway
Multi-provider normalization	Built-in (100+ providers)	Manual implementation	Not applicable
Per-team budgets	Built-in	Custom middleware	Basic rate limiting only
Streaming support	Full SSE passthrough	Must implement	Varies by provider
Cost tracking	Automatic per-request	Custom instrumentation	Not available
Failover	Model-name based	Custom logic	Health check routing
Setup time	Hours	Weeks	Days

Telemetry and Cost Attribution

Every request through the gateway generates a telemetry event containing: the team and key that made the request, the model and provider used, input and output token counts, latency at each middleware stage, the computed cost, and any errors. These events flow into your observability pipeline (see the AI Observability Stack blueprint) for dashboarding. The cost attribution data is what makes budget conversations with engineering teams productive — you can show exactly which features, endpoints, and user segments drive LLM spend.

Never log prompt content in telemetry by default. Prompts often contain user data or PII. Log metadata (token counts, model, latency) for every request, and only log prompt content when explicitly enabled for debugging, with automatic rotation and access controls.

Deployment

Governance

Reliability

Version History

1.0.0 · 2026-03-01

• Initial publication with LiteLLM Proxy setup and Docker Compose deployment
• Per-team virtual keys with budget and rate limit enforcement
• Provider failover configuration and health check patterns
• Cost attribution telemetry and gateway feature comparison

Key Takeaway

Prerequisites

Python 3.11+ or Docker for running LiteLLM Proxy
PostgreSQL or Redis for rate limit state and usage tracking
API keys for at least two LLM providers (Anthropic, OpenAI, or self-hosted)
Basic understanding of reverse proxies and HTTP middleware patterns
A monitoring stack (Prometheus + Grafana or equivalent) for dashboarding

Why a Centralized LLM Gateway?

Architecture Overview

Setting Up LiteLLM Proxy

litellm_config.yaml

# LiteLLM Proxy configuration
model_list:
  # Primary: Anthropic Claude
  - model_name: "claude-sonnet"
    litellm_params:
      model: "anthropic/claude-sonnet-4-20250514"
      api_key: "os.environ/ANTHROPIC_API_KEY"
      max_tokens: 8192
      timeout: 120
    model_info:
      input_cost_per_token: 0.000003
      output_cost_per_token: 0.000015

  # Fallback: OpenAI GPT-4o
  - model_name: "claude-sonnet"  # Same name = automatic fallback
    litellm_params:
      model: "openai/gpt-4o"
      api_key: "os.environ/OPENAI_API_KEY"
      max_tokens: 8192
      timeout: 120
    model_info:
      input_cost_per_token: 0.0000025
      output_cost_per_token: 0.00001

  # Fast model for classification
  - model_name: "fast"
    litellm_params:
      model: "anthropic/claude-haiku-4-5-20251001"
      api_key: "os.environ/ANTHROPIC_API_KEY"
      max_tokens: 4096
      timeout: 30

litellm_settings:
  drop_params: true            # Drop unsupported params per provider
  set_verbose: false
  cache: true
  cache_params:
    type: "redis"
    host: "os.environ/REDIS_HOST"
    port: 6379
    ttl: 3600

general_settings:
  master_key: "os.environ/LITELLM_MASTER_KEY"
  database_url: "os.environ/DATABASE_URL"
  alerting:
    - "slack"
  alert_types:
    - "llm_exceptions"
    - "budget_alerts"

docker-compose.yml

version: "3.9"
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm
      - REDIS_HOST=redis
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    depends_on:
      - postgres
      - redis

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: password
    volumes:
      - pgdata:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

volumes:
  pgdata:

Per-Team Rate Limiting and Budgets

gateway/team_setup.py

"""Create teams and virtual keys via the LiteLLM management API."""

import httpx

GATEWAY_URL = "http://localhost:4000"
MASTER_KEY = "sk-master-..."


async def create_team_and_key(
    team_name: str,
    monthly_budget_usd: float,
    rpm_limit: int = 100,
    models: list[str] | None = None,
) -> dict:
    """Provision a team with budget limits and generate an API key.

    Args:
        team_name: Human-readable team identifier.
        monthly_budget_usd: Maximum monthly spend in USD.
        rpm_limit: Requests per minute limit.
        models: Allowed model names (None = all models).

    Returns:
        Dict with team_id and api_key for the team.
    """
    async with httpx.AsyncClient() as client:
        # Create team
        team_resp = await client.post(
            f"{GATEWAY_URL}/team/new",
            headers={"Authorization": f"Bearer {MASTER_KEY}"},
            json={
                "team_alias": team_name,
                "max_budget": monthly_budget_usd,
                "rpm_limit": rpm_limit,
                "models": models or [],
                "budget_duration": "1mo",
            },
        )
        team_data = team_resp.json()

        # Generate key for the team
        key_resp = await client.post(
            f"{GATEWAY_URL}/key/generate",
            headers={"Authorization": f"Bearer {MASTER_KEY}"},
            json={
                "team_id": team_data["team_id"],
                "key_alias": f"{team_name}-production",
                "max_budget": monthly_budget_usd,
                "rpm_limit": rpm_limit,
            },
        )
        key_data = key_resp.json()

        return {
            "team_id": team_data["team_id"],
            "api_key": key_data["key"],
        }

Provider Failover and Health Checks

Gateway Feature Comparison

Feature	LiteLLM	Custom Proxy	Cloud API Gateway
Multi-provider normalization	Built-in (100+ providers)	Manual implementation	Not applicable
Per-team budgets	Built-in	Custom middleware	Basic rate limiting only
Streaming support	Full SSE passthrough	Must implement	Varies by provider
Cost tracking	Automatic per-request	Custom instrumentation	Not available
Failover	Model-name based	Custom logic	Health check routing
Setup time	Hours	Weeks	Days

Telemetry and Cost Attribution

Deployment

Governance

Reliability

Version History

1.0.0 · 2026-03-01

• Initial publication with LiteLLM Proxy setup and Docker Compose deployment
• Per-team virtual keys with budget and rate limit enforcement
• Provider failover configuration and health check patterns
• Cost attribution telemetry and gateway feature comparison

LLM Gateway & API Gateway Setup

Why a Centralized LLM Gateway?

Architecture Overview

Setting Up LiteLLM Proxy

Per-Team Rate Limiting and Budgets

Provider Failover and Health Checks

Gateway Feature Comparison

Telemetry and Cost Attribution

Deployment

Governance

Reliability

Version History

Related content

LLM Gateway & API Gateway Setup

Why a Centralized LLM Gateway?

Architecture Overview

Setting Up LiteLLM Proxy

Per-Team Rate Limiting and Budgets

Provider Failover and Health Checks

Gateway Feature Comparison

Telemetry and Cost Attribution

Deployment

Governance

Reliability

Version History

Related content