Scaling ML Inference Without Overengineering
A pragmatic architecture for stable latency and predictable cost
•1 min read
System DesignInferenceScalability
Problem
Inference traffic was spiky, and p95 latency degraded sharply during peak usage.
Constraints
- Limited platform team bandwidth
- Cost ceilings per request
- Need for rapid rollback when model regressions appear
Architecture
- Async queue for burst absorption
- Autoscaled inference workers with warm pools
- Layered caching for repeated queries
- Canary deployment for safer model rollouts
Trade-offs
- Warm pools improve latency but increase idle cost
- Aggressive caching reduces cost but risks stale outputs
- Canary strategy slows rollout but protects quality
Results
- p95 latency improved from 920ms to 510ms
- Compute cost per 1k requests dropped by 23%
- Rollback time reduced to under 5 minutes