Event-Driven Self-Healing Infrastructure: A Conceptual Framework for Intelligent Automation in Site Reliability Engineering
Main Article Content
Abstract
Since organizations have adopted microservice and cloud-native architectures, resilience and autonomous operations have become more and more popular. In this paper, the focus is an event-driven self-healing infrastructure concept at the Site Reliability Engineering (SRE). The framework allows proactive detective of incidents and their resolution without the involvement of humans by following up on the real-time observability pipelines, serverless automation, and AI-driven decision engines. It has been evaluated that the mean time to repair and recovery accuracy have significantly dropped as well as the operational cost. Complexity of integration and requirement to oversight in edge cases are other challenges studied in the paper, which provides a practical road map to achieving intelligent and self-managing systems that improve reliability of services provided.