Key Takeaway

Separating inference auto-scaling from training job scheduling prevents resource contention and allows each workload type to scale according to its own performance characteristics. This guide covers six scaling patterns with architecture diagrams, infrastructure-as-code examples, and cost modeling for each.

Prerequisites

At least one ML model in production or a clear deployment timeline
Kubernetes cluster or managed container orchestration platform
GPU instance access from at least one cloud provider (AWS, GCP, Azure)
Understanding of your model's resource requirements (GPU memory, compute, storage)
Traffic pattern data or estimates for capacity planning

AI Scaling Is Different

AI workloads have fundamentally different scaling characteristics than traditional web applications. A web server handles requests in milliseconds with negligible marginal compute cost. A model inference endpoint may take seconds per request, consume gigabytes of GPU memory per replica, and cost orders of magnitude more per request. This means that scaling decisions have a much larger cost impact, over-provisioning is extremely expensive, and under-provisioning causes cascading latency failures rather than graceful degradation.

The other critical difference is heterogeneity. A traditional web application scales by adding identical replicas behind a load balancer. AI infrastructure must support multiple model types (each with different resource requirements), mixed workload priorities (latency-sensitive inference vs. throughput-optimized training), and different scaling behaviors (inference scales with request volume, training scales with data volume and desired iteration speed). A single scaling policy cannot handle this diversity.

Pattern 1: Horizontal Inference Scaling

Horizontal scaling adds model replicas behind a load balancer to handle increased request volume. Each replica loads the same model into GPU memory and serves requests independently. This is the simplest and most common scaling pattern, appropriate for models that fit on a single GPU and serve latency-sensitive traffic. The key decision is the scaling metric: scale on GPU utilization for compute-bound models, on request queue depth for I/O-bound models, or on latency percentiles for SLA-sensitive endpoints.

inference-autoscaler.yaml

# Kubernetes HPA for inference scaling
# Scales based on GPU utilization and request queue depth
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-inference-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-inference
  minReplicas: 2          # Minimum for HA
  maxReplicas: 20         # Budget cap
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30   # React fast
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Scale down slowly
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120
  metrics:
    # Primary: GPU utilization
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: "70"    # Scale up above 70%
    # Secondary: Request queue depth
    - type: Pods
      pods:
        metric:
          name: pending_requests
        target:
          type: AverageValue
          averageValue: "5"     # Max 5 pending per replica

Pattern 2: Queue-Based Scaling

Queue-based scaling decouples request ingestion from processing, absorbing traffic bursts without dropping requests or adding expensive replicas for peak capacity. Requests are placed in a priority queue and consumed by a pool of GPU workers. The worker pool scales based on queue depth rather than request rate. This pattern is ideal for workloads that can tolerate latency measured in seconds rather than milliseconds, such as document processing, batch classification, and async content generation.

Pattern 3: Multi-Model Serving

Multi-model serving runs multiple models on shared GPU infrastructure, improving utilization by packing models that have complementary traffic patterns onto the same GPU fleet. The key challenge is GPU memory management: each model consumes a fixed amount of GPU memory when loaded, and loading/unloading models is slow (seconds to minutes). Effective multi-model serving requires a model loading policy that keeps frequently-accessed models in GPU memory and evicts rarely-used models to CPU memory or disk.

Pattern 4: Multi-Region Deployment

Multi-region deployment serves model inference from multiple geographic locations for three reasons: latency reduction (serve users from the nearest region), availability (survive a regional outage), and data residency (keep data processing within regulatory boundaries). The cost of multi-region is significant because you need GPU capacity in each region. The trade-off is justified for latency-sensitive applications with a global user base or for applications with strict data residency requirements.

GPU spot instance availability varies dramatically by region and instance type. Build your capacity planning around the worst case for each region, and maintain on-demand fallback capacity for critical workloads. Running a multi-region AI deployment entirely on spot instances is a recipe for correlated availability failures.

Capacity Planning

AI capacity planning must account for GPU memory constraints (each model requires a fixed amount of GPU memory regardless of load), cold start times (loading a model into GPU memory can take 30 seconds to several minutes), and the non-linear relationship between replica count and latency. Adding replicas reduces average latency but does not reduce worst-case latency caused by individual slow inferences. Plan for headroom: target 60-70% GPU utilization to absorb traffic spikes without latency degradation.

0/10 completed

Version History

1.0.0 · 2026-03-01

• Initial release with four scaling patterns: horizontal, queue-based, multi-model, and multi-region
• Kubernetes HPA configuration example for GPU-based auto-scaling
• Multi-region architecture diagram with geo-routing and failover
• Capacity planning guidance for GPU memory and cold start constraints
• Scaling readiness checklist

Key Takeaway

Prerequisites

At least one ML model in production or a clear deployment timeline
Kubernetes cluster or managed container orchestration platform
GPU instance access from at least one cloud provider (AWS, GCP, Azure)
Understanding of your model's resource requirements (GPU memory, compute, storage)
Traffic pattern data or estimates for capacity planning

AI Scaling Is Different

Pattern 1: Horizontal Inference Scaling

inference-autoscaler.yaml

# Kubernetes HPA for inference scaling
# Scales based on GPU utilization and request queue depth
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-inference-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-inference
  minReplicas: 2          # Minimum for HA
  maxReplicas: 20         # Budget cap
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30   # React fast
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Scale down slowly
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120
  metrics:
    # Primary: GPU utilization
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: "70"    # Scale up above 70%
    # Secondary: Request queue depth
    - type: Pods
      pods:
        metric:
          name: pending_requests
        target:
          type: AverageValue
          averageValue: "5"     # Max 5 pending per replica

Pattern 2: Queue-Based Scaling

Pattern 3: Multi-Model Serving

Pattern 4: Multi-Region Deployment

Architecture Diagram

Process

Connections

Global DNSUS-East Region(US traffic)

Global DNSEU-West Region(EU traffic)

Global DNSAP-Southeast Region(APAC traffic)

Model RegistryUS-East Region(sync)

Model RegistryEU-West Region(sync)

Model RegistryAP-Southeast Region(sync)

US-East RegionEU-West Region(failover)

Multi-region AI inference architecture with geo-routing and failover

Capacity Planning

0/10 completed

Version History

1.0.0 · 2026-03-01

• Initial release with four scaling patterns: horizontal, queue-based, multi-model, and multi-region
• Kubernetes HPA configuration example for GPU-based auto-scaling
• Multi-region architecture diagram with geo-routing and failover
• Capacity planning guidance for GPU memory and cold start constraints
• Scaling readiness checklist

AI Infrastructure Scaling Patterns

AI Scaling Is Different

Pattern 1: Horizontal Inference Scaling

Pattern 2: Queue-Based Scaling

Pattern 3: Multi-Model Serving

Pattern 4: Multi-Region Deployment

Capacity Planning

Version History

Related content

AI Infrastructure Scaling Patterns

AI Scaling Is Different

Pattern 1: Horizontal Inference Scaling

Pattern 2: Queue-Based Scaling

Pattern 3: Multi-Model Serving

Pattern 4: Multi-Region Deployment

Global DNS

US-East Region

EU-West Region

AP-Southeast Region

Model Registry

Capacity Planning

Version History

Related content

Global DNS

US-East Region

EU-West Region

AP-Southeast Region

Model Registry