Dynamic Fault Tolerance and Performance Optimization in Grid Computing Using Unified Checkpointing and Replica Management
Main Article Content
Abstract
This paper presents a novel dynamic fault tolerance mechanism for grid computing, utilizing the Unified Checkpoint ing technique combined with task replication and replica management to enhance performance and reliability in distributed, heterogeneous environments. The proposed method addresses challenges inherent in opportunistic grid environments, such as machine failures, network partitions, and resource availability fluctuations. By dynamically adjusting the Number of replicas, monitoring resource status, and utilizing the most advanced replica’s checkpoint for recovery, the system minimizes downtime and optimizes task execution time. Experimental results, based on simulations using the GridSim toolkit, demonstrate a significant reduction in task execution time (up to 47% improvement) when compared to traditional approaches. The research highlights the potential of this approach in improving the performance of long-running tasks, especially in unpredictable computing environments such as student laboratories or other resource-constrained settings. Additionally, ongoing work focuses on adaptive feedback mechanisms to further optimize replica management and check pointing strategies based on environmental factors.