Self-Healing Infrastructure: Leveraging Reinforcement Learning for Autonomous Cloud Recovery and Enhanced Resilience

Rohit Laheri

doi:10.52783/jisem.v10i49s.9888

PDF

Published: May 22, 2025

DOI: https://doi.org/10.52783/jisem.v10i49s.9888

Keywords:

self-healing, cloud resilience, reinforcement learning, predictive analytics, rule-based automation, infrastructure as code, Kubernetes, cloud recovery

Rohit Laheri, Harish Kumar Krishnamurthy Sukumar, Chandrashekar Kola, Yashasvi Makin

Abstract

Maintaining high availability and reliability in dynamic cloud environments demands proactive, automated solutions capable of handling failures at scale. This study introduces a novel multi-layer self-healing infrastructure framework that unites predictive analytics, reinforcement learning (RL), and rule-based automation into a cohesive, horizontally scalable system. Predictive analytics continuously ingests telemetry—CPU, memory, network metrics, and application logs—using time-series forecasting (ARIMA, LSTM) and unsupervised anomaly detection (Isolation Forest, k-Means) to flag potential faults with >96% accuracy. An RL agent employing Proximal Policy Optimization (PPO) then dynamically selects recovery actions (e.g., container restart, horizontal scaling, resource reallocation) guided by a reward function that balances rapid Mean Time To Repair (MTTR) reduction with minimal resource overhead and service impact. Simultaneously, rule-based playbooks address frequent failure patterns, ensuring immediate remediation within 30 seconds for predictable incidents. Deployed as Infrastructure as Code (IaC) via Terraform and Helm on Kubernetes clusters across AWS and Azure, our framework was validated over 220 fault scenarios. Key performance indicators demonstrate an 85% MTTR reduction (from 90 to 13.5 minutes), recovery reliability exceeding 95%, fault tolerance above 91%, and system uptime surpassing 98%. Resource overhead during recovery remains under 10%. Compared to prior isolated methods—rule-based MTTR reduction of 60%, RL-only MTTR reduction of 50%, and anomaly detection without remediation—our integrated model delivers superior resilience and operational efficiency. This paper is organized as follows: Section 1 introduces the problem and contributions; Section 2 reviews related work; Section 3 details the methodology; Section 4 describes experimental setup; Section 5 presents results; Section 6 discusses implications and limitations; Section 7 outlines future research directions; and Section 8 concludes.

Issue

Vol. 10 No. 49s (2025)

Section

Articles

Journal of Information Systems Engineering and Management

Self-Healing Infrastructure: Leveraging Reinforcement Learning for Autonomous Cloud Recovery and Enhanced Resilience

Abstract

Volume 10 (2025)

Volume 9 (2024)

Volume 8 (2023)

Volume 7 (2022)

Volume 6 (2021)

Volume 5 (2020)

Volume 4 (2019)

Volume 3 (2018)

Volume 2 (2017)

Volume 1 (2016)

Journal of Information Systems Engineering and Management

Article Sidebar

Main Article Content

Abstract

Article Details