SLO-First Autoscaling for Multi-Tenant Microservices: A Control-Theoretic Approach to P95/P99 Latency
Main Article Content
Abstract
Maintaining strict service level objectives at the tail of latency distributions (P95/P99) remains challenging in large-scale, multi-tenant cloud platforms. Conventional autoscalers scale workloads based on resource utilization metrics such as CPU or memory, reacting only after latency violations occur and failing to anticipate bursty demand or cross-tenant interference. This work presents a control-theoretic resource management framework that proactively enforces latency SLOs across microservices by modeling each service as a queueing system. The framework continuously infers service time distributions and backlog states to predict future tail latency under varying load conditions, then applies model predictive control to allocate resources before violations manifest. The design incorporates multi-tenant fairness mechanisms that isolate noisy tenants while preserving global cost efficiency through constrained optimization over prediction horizons. Evaluation on synthetic burst traces and production-like workloads demonstrates 73% reduction in P95 violations, 68% reduction in P99 violations, 2.3× faster scaling convergence compared to reactive methods, and 41% resource cost savings relative to static over-provisioning strategies. The framework establishes a principled foundation for latency-driven, cost-aware resource management in multi-tenant cloud environments.