Self-Healing Infrastructure: Leveraging Reinforcement Learning for Autonomous Cloud Recovery and Enhanced Resilience
Main Article Content
Abstract
Maintaining high availability and reliability in dynamic cloud environments demands proactive, automated solutions capable of handling failures at scale. This study introduces a novel multi-layer self-healing infrastructure framework that unites predictive analytics, reinforcement learning (RL), and rule-based automation into a cohesive, horizontally scalable system. Predictive analytics continuously ingests telemetry—CPU, memory, network metrics, and application logs—using time-series forecasting (ARIMA, LSTM) and unsupervised anomaly detection (Isolation Forest, k-Means) to flag potential faults with >96% accuracy. An RL agent employing Proximal Policy Optimization (PPO) then dynamically selects recovery actions (e.g., container restart, horizontal scaling, resource reallocation) guided by a reward function that balances rapid Mean Time To Repair (MTTR) reduction with minimal resource overhead and service impact. Simultaneously, rule-based playbooks address frequent failure patterns, ensuring immediate remediation within 30 seconds for predictable incidents. Deployed as Infrastructure as Code (IaC) via Terraform and Helm on Kubernetes clusters across AWS and Azure, our framework was validated over 220 fault scenarios. Key performance indicators demonstrate an 85% MTTR reduction (from 90 to 13.5 minutes), recovery reliability exceeding 95%, fault tolerance above 91%, and system uptime surpassing 98%. Resource overhead during recovery remains under 10%. Compared to prior isolated methods—rule-based MTTR reduction of 60%, RL-only MTTR reduction of 50%, and anomaly detection without remediation—our integrated model delivers superior resilience and operational efficiency. This paper is organized as follows: Section 1 introduces the problem and contributions; Section 2 reviews related work; Section 3 details the methodology; Section 4 describes experimental setup; Section 5 presents results; Section 6 discusses implications and limitations; Section 7 outlines future research directions; and Section 8 concludes.