Scaling ML Inference Without Overengineering

A pragmatic architecture for stable latency and predictable cost

February 4, 2024•1 min read

System DesignInferenceScalability

Problem

Inference traffic was spiky, and p95 latency degraded sharply during peak usage.

Constraints

Limited platform team bandwidth
Cost ceilings per request
Need for rapid rollback when model regressions appear

Architecture

Async queue for burst absorption
Autoscaled inference workers with warm pools
Layered caching for repeated queries
Canary deployment for safer model rollouts

Trade-offs

Warm pools improve latency but increase idle cost
Aggressive caching reduces cost but risks stale outputs
Canary strategy slows rollout but protects quality

Results

p95 latency improved from 920ms to 510ms
Compute cost per 1k requests dropped by 23%
Rollback time reduced to under 5 minutes