Key Takeaway
Separating inference auto-scaling from training job scheduling prevents resource contention and allows each workload type to scale according to its own performance characteristics. This guide covers six scaling patterns with architecture diagrams, infrastructure-as-code examples, and cost modeling for each.
Prerequisites
- At least one ML model in production or a clear deployment timeline
- Kubernetes cluster or managed container orchestration platform
- GPU instance access from at least one cloud provider (AWS, GCP, Azure)
- Understanding of your model's resource requirements (GPU memory, compute, storage)
- Traffic pattern data or estimates for capacity planning
AI Scaling Is Different
AI workloads have fundamentally different scaling characteristics than traditional web applications. A web server handles requests in milliseconds with negligible marginal compute cost. A model inference endpoint may take seconds per request, consume gigabytes of GPU memory per replica, and cost orders of magnitude more per request. This means that scaling decisions have a much larger cost impact, over-provisioning is extremely expensive, and under-provisioning causes cascading latency failures rather than graceful degradation.
The other critical difference is heterogeneity. A traditional web application scales by adding identical replicas behind a load balancer. AI infrastructure must support multiple model types (each with different resource requirements), mixed workload priorities (latency-sensitive inference vs. throughput-optimized training), and different scaling behaviors (inference scales with request volume, training scales with data volume and desired iteration speed). A single scaling policy cannot handle this diversity.
Pattern 1: Horizontal Inference Scaling
Horizontal scaling adds model replicas behind a load balancer to handle increased request volume. Each replica loads the same model into GPU memory and serves requests independently. This is the simplest and most common scaling pattern, appropriate for models that fit on a single GPU and serve latency-sensitive traffic. The key decision is the scaling metric: scale on GPU utilization for compute-bound models, on request queue depth for I/O-bound models, or on latency percentiles for SLA-sensitive endpoints.
# Kubernetes HPA for inference scaling
# Scales based on GPU utilization and request queue depth
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-inference-hpa
namespace: ml-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-inference
minReplicas: 2 # Minimum for HA
maxReplicas: 20 # Budget cap
behavior:
scaleUp:
stabilizationWindowSeconds: 30 # React fast
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Scale down slowly
policies:
- type: Pods
value: 2
periodSeconds: 120
metrics:
# Primary: GPU utilization
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "70" # Scale up above 70%
# Secondary: Request queue depth
- type: Pods
pods:
metric:
name: pending_requests
target:
type: AverageValue
averageValue: "5" # Max 5 pending per replicaPattern 2: Queue-Based Scaling
Queue-based scaling decouples request ingestion from processing, absorbing traffic bursts without dropping requests or adding expensive replicas for peak capacity. Requests are placed in a priority queue and consumed by a pool of GPU workers. The worker pool scales based on queue depth rather than request rate. This pattern is ideal for workloads that can tolerate latency measured in seconds rather than milliseconds, such as document processing, batch classification, and async content generation.
Pattern 3: Multi-Model Serving
Multi-model serving runs multiple models on shared GPU infrastructure, improving utilization by packing models that have complementary traffic patterns onto the same GPU fleet. The key challenge is GPU memory management: each model consumes a fixed amount of GPU memory when loaded, and loading/unloading models is slow (seconds to minutes). Effective multi-model serving requires a model loading policy that keeps frequently-accessed models in GPU memory and evicts rarely-used models to CPU memory or disk.
Pattern 4: Multi-Region Deployment
Multi-region deployment serves model inference from multiple geographic locations for three reasons: latency reduction (serve users from the nearest region), availability (survive a regional outage), and data residency (keep data processing within regulatory boundaries). The cost of multi-region is significant because you need GPU capacity in each region. The trade-off is justified for latency-sensitive applications with a global user base or for applications with strict data residency requirements.
GPU spot instance availability varies dramatically by region and instance type. Build your capacity planning around the worst case for each region, and maintain on-demand fallback capacity for critical workloads. Running a multi-region AI deployment entirely on spot instances is a recipe for correlated availability failures.
Capacity Planning
AI capacity planning must account for GPU memory constraints (each model requires a fixed amount of GPU memory regardless of load), cold start times (loading a model into GPU memory can take 30 seconds to several minutes), and the non-linear relationship between replica count and latency. Adding replicas reduces average latency but does not reduce worst-case latency caused by individual slow inferences. Plan for headroom: target 60-70% GPU utilization to absorb traffic spikes without latency degradation.
Version History
1.0.0 · 2026-03-01
- • Initial release with four scaling patterns: horizontal, queue-based, multi-model, and multi-region
- • Kubernetes HPA configuration example for GPU-based auto-scaling
- • Multi-region architecture diagram with geo-routing and failover
- • Capacity planning guidance for GPU memory and cold start constraints
- • Scaling readiness checklist