Dynamic Algorithm Selection for Replication using RSYNC

Main Article Content

Kamal Shah, Sunny Sall, Abhijeet Panpatil

Abstract

Introduction: In the era of big data and distributed systems, efficient data replication is essential for maintaining data availability, consistency, and performance. Rsync, a widely used file synchronization tool, utilizes compression algorithms to optimize data transfer. However, its performance is highly dependent on the choice of compression algorithm, file type, and network conditions. This research introduces a novel approach to dynamic algorithm selection for Rsync, leveraging machine learning to automatically determine the optimal compression algorithm (gzip, zstd, or lz4) based on file characteristics and network parameters.


Objectives: This study aims to enhance Rsync’s efficiency by developing a machine learning model for dynamic algorithm selection, evaluating the impact of compression choices on synchronization time and bandwidth usage, and implementing an adaptive system that optimizes Rsync’s performance based on file characteristics and network conditions.


Methods: The methodology follows a six-phase approach. First, an experimental environment is set up. Second, a dataset consisting of text, binary, and backup files is prepared. Third, a shell script is developed to automate compression and synchronization. Fourth, performance metrics such as compression time, Rsync time, and bandwidth usage are collected. Fifth, machine learning models (Random Forest, Decision Tree, and Linear Regression) are trained to predict the optimal compression algorithm. Finally, the model is validated through extensive testing.


Results: The AI models demonstrated high accuracy in predicting the best algorithm. Random Forest achieved superior performance with Train R² = 98.45%, Test R² = 97.82%, MSE Train = 0.45, and MSE Test = 0.72. Decision Tree showed strong training accuracy with Train R² = 99.50% but slightly lower generalization with Test R² = 94.10%. Linear Regression provided a solid baseline with Train R² = 94.85% and Test R² = 93.72%. Dynamic algorithm selection significantly reduced synchronization time and bandwidth consumption across different scenarios.


Conclusions: This research presents an intelligent, adaptive system that enhances Rsync’s efficiency for data replication. By integrating machine learning, Rsync can dynamically select the optimal compression algorithm, improving performance in real-world applications. This approach contributes to the field of data replication by offering a scalable and automated solution for optimizing file synchronization.

Article Details

Section
Articles