Key Takeaway
By the end of this blueprint you will have a production LLM gateway built on LiteLLM that provides a unified OpenAI-compatible API across Anthropic, OpenAI, and open-source providers, with per-team rate limiting, budget enforcement, automatic failover, and structured telemetry for every request.
Prerequisites
- Python 3.11+ or Docker for running LiteLLM Proxy
- PostgreSQL or Redis for rate limit state and usage tracking
- API keys for at least two LLM providers (Anthropic, OpenAI, or self-hosted)
- Basic understanding of reverse proxies and HTTP middleware patterns
- A monitoring stack (Prometheus + Grafana or equivalent) for dashboarding
Why a Centralized LLM Gateway?
Without a gateway, every team manages their own API keys, rate limits, and provider integrations. This creates three problems that compound as you scale: cost visibility disappears because spend is scattered across dozens of API keys with no central attribution; security weakens because API keys are embedded in application configs across repositories; and reliability suffers because each application must implement its own retry and failover logic. A gateway centralizes all of this into a single layer that platform teams operate.
Architecture Overview
The gateway sits as a reverse proxy between application teams and LLM providers. Incoming requests pass through an authentication layer, a rate-limit and budget-enforcement layer, and a routing layer that selects the optimal provider based on model requirements, latency, and cost. Response streams are proxied back with injected telemetry headers for downstream observability.
Setting Up LiteLLM Proxy
LiteLLM is the most mature open-source LLM proxy. It normalizes 100+ provider APIs into the OpenAI chat completions format, handles streaming, and provides built-in support for rate limiting, budget tracking, and model aliasing. We run it as a Docker container behind a load balancer, with a PostgreSQL backend for persistent configuration and usage tracking.
# LiteLLM Proxy configuration
model_list:
# Primary: Anthropic Claude
- model_name: "claude-sonnet"
litellm_params:
model: "anthropic/claude-sonnet-4-20250514"
api_key: "os.environ/ANTHROPIC_API_KEY"
max_tokens: 8192
timeout: 120
model_info:
input_cost_per_token: 0.000003
output_cost_per_token: 0.000015
# Fallback: OpenAI GPT-4o
- model_name: "claude-sonnet" # Same name = automatic fallback
litellm_params:
model: "openai/gpt-4o"
api_key: "os.environ/OPENAI_API_KEY"
max_tokens: 8192
timeout: 120
model_info:
input_cost_per_token: 0.0000025
output_cost_per_token: 0.00001
# Fast model for classification
- model_name: "fast"
litellm_params:
model: "anthropic/claude-haiku-4-5-20251001"
api_key: "os.environ/ANTHROPIC_API_KEY"
max_tokens: 4096
timeout: 30
litellm_settings:
drop_params: true # Drop unsupported params per provider
set_verbose: false
cache: true
cache_params:
type: "redis"
host: "os.environ/REDIS_HOST"
port: 6379
ttl: 3600
general_settings:
master_key: "os.environ/LITELLM_MASTER_KEY"
database_url: "os.environ/DATABASE_URL"
alerting:
- "slack"
alert_types:
- "llm_exceptions"
- "budget_alerts"version: "3.9"
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
volumes:
- ./litellm_config.yaml:/app/config.yaml
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- OPENAI_API_KEY=${OPENAI_API_KEY}
- DATABASE_URL=postgresql://litellm:password@postgres:5432/litellm
- REDIS_HOST=redis
- LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
command: ["--config", "/app/config.yaml", "--port", "4000"]
depends_on:
- postgres
- redis
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: litellm
POSTGRES_USER: litellm
POSTGRES_PASSWORD: password
volumes:
- pgdata:/var/lib/postgresql/data
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
pgdata:Per-Team Rate Limiting and Budgets
LiteLLM supports virtual keys that map to teams or projects. Each key gets its own rate limit (requests per minute) and budget ceiling (maximum spend per month). When a team hits their budget, the gateway returns a 429 with a clear message rather than silently draining the organization's API credits. This is the single most important governance feature — without it, a runaway batch job in one team can exhaust the entire organization's API budget in hours.
"""Create teams and virtual keys via the LiteLLM management API."""
import httpx
GATEWAY_URL = "http://localhost:4000"
MASTER_KEY = "sk-master-..."
async def create_team_and_key(
team_name: str,
monthly_budget_usd: float,
rpm_limit: int = 100,
models: list[str] | None = None,
) -> dict:
"""Provision a team with budget limits and generate an API key.
Args:
team_name: Human-readable team identifier.
monthly_budget_usd: Maximum monthly spend in USD.
rpm_limit: Requests per minute limit.
models: Allowed model names (None = all models).
Returns:
Dict with team_id and api_key for the team.
"""
async with httpx.AsyncClient() as client:
# Create team
team_resp = await client.post(
f"{GATEWAY_URL}/team/new",
headers={"Authorization": f"Bearer {MASTER_KEY}"},
json={
"team_alias": team_name,
"max_budget": monthly_budget_usd,
"rpm_limit": rpm_limit,
"models": models or [],
"budget_duration": "1mo",
},
)
team_data = team_resp.json()
# Generate key for the team
key_resp = await client.post(
f"{GATEWAY_URL}/key/generate",
headers={"Authorization": f"Bearer {MASTER_KEY}"},
json={
"team_id": team_data["team_id"],
"key_alias": f"{team_name}-production",
"max_budget": monthly_budget_usd,
"rpm_limit": rpm_limit,
},
)
key_data = key_resp.json()
return {
"team_id": team_data["team_id"],
"api_key": key_data["key"],
}Set budget alerts at 80% of ceiling so teams get a warning before they hit their limit. LiteLLM supports Slack and email alerting natively. Configure alert_types to include 'budget_alerts' in your config.
Provider Failover and Health Checks
The gateway should automatically fail over to a backup provider when the primary returns errors or exceeds latency thresholds. LiteLLM handles this by trying providers in order when multiple models share the same model_name. For finer control, implement a health check loop that probes each provider with a lightweight completion request every 30 seconds. Mark unhealthy providers as unavailable in the routing table until they recover.
Gateway Feature Comparison
| Feature | LiteLLM | Custom Proxy | Cloud API Gateway |
|---|---|---|---|
| Multi-provider normalization | Built-in (100+ providers) | Manual implementation | Not applicable |
| Per-team budgets | Built-in | Custom middleware | Basic rate limiting only |
| Streaming support | Full SSE passthrough | Must implement | Varies by provider |
| Cost tracking | Automatic per-request | Custom instrumentation | Not available |
| Failover | Model-name based | Custom logic | Health check routing |
| Setup time | Hours | Weeks | Days |
Telemetry and Cost Attribution
Every request through the gateway generates a telemetry event containing: the team and key that made the request, the model and provider used, input and output token counts, latency at each middleware stage, the computed cost, and any errors. These events flow into your observability pipeline (see the AI Observability Stack blueprint) for dashboarding. The cost attribution data is what makes budget conversations with engineering teams productive — you can show exactly which features, endpoints, and user segments drive LLM spend.
Never log prompt content in telemetry by default. Prompts often contain user data or PII. Log metadata (token counts, model, latency) for every request, and only log prompt content when explicitly enabled for debugging, with automatic rotation and access controls.
Deployment
Governance
Reliability
Version History
1.0.0 · 2026-03-01
- • Initial publication with LiteLLM Proxy setup and Docker Compose deployment
- • Per-team virtual keys with budget and rate limit enforcement
- • Provider failover configuration and health check patterns
- • Cost attribution telemetry and gateway feature comparison