RegBoost: A Novel Hybrid Methodology for Enhanced Spatiotemporal Forecasting of PM2.5 in Gurugram (2019-2023) Using Multi-Source Ground-Based, Meteorological, and Sentinel-5P Data
Main Article Content
Abstract
Introduction: This research addresses Particulate Matter (PM2.5) pollution dynamics in Gurugram, an integral part of the National Capital Region (NCR) of India, it holds infamy for possibly the most notorious pollution levels for the past decade, thus rendering it a priority research and intervention area. This research mainly focuses on the Gurugram air quality data during 2019 to 2023.
Objectives: In this research study, a novel hybrid methodology, referred to in this research paper as ‘RegBoost’, for enhanced spatiotemporal forecasting of PM2.5 has been developed and further evaluated against the recent research within this domain. The existing models face challenges related to data accuracy, missing values, scarcity of data and unreliable data sources, which limit their predictive performance.
Methods: In this research, through a multi-faceted approach, we employ the Central Pollution Control Board (CPCB), National Aeronautics and Space Agency (NASA) and Sentinel 5-Precursor (Sentinel 5P) data to bolster our accuracy. The novel frameworks employed in this research study includes hybrid imputation technique that fuses Matrix Factorisation and K-Nearest Neighbors (Hybrid MF+KNN), hybrid predictive model fusing Ridge Regression and XGBoost (RR+XGBoost). RegBoost apart from the incorporation of the hybrid imputation and preprocessing steps, it uses the combined methodology of the aforementioned Ridge Regression and XGBoost along with the data from the CPCB, NASA and Sentinel 5P. The performance of this methodology is benchmarked against current literature and previous studies to assess its comparative efficacy. This research further provides the most comprehensive and up-to-date analysis within Gurugram which is still a major problem area due to rising pollution levels and limited solutions. This research seeks to fill the gap of spatiotemporal forecasting as most of the models are limited to temporal forecasting or spatial forecasting for a single location. The methodology is specifically designed to address the challenges in spatiotemporal air quality forecasting, including data inconsistencies and the complex interplay of various pollution factors.
Results: This research was able to achieve a remarkable improvement in the accuracy of air quality forecasting due to the incorporation of the novel hybrid methods. The hybrid imputation method significantly reduced the data gaps, which previously affected the predictive performance. The RR+XGBoost model demonstrated superior performance in capturing complex patterns and relationships within the spatiotemporal data, leading to more precise and reliable PM2.5 predictions. Our models also show consistency across different seasonal variations and during stubble burning periods, proving their robustness.
Conclusions: The RegBoost methodology offers a robust and effective solution for enhanced spatiotemporal PM2.5 forecasting in urban environments. Its ability to integrate diverse data sources and handle missing values effectively positions it as a valuable tool for environmental monitoring, policy-making, and public health initiatives. The insights gained from this study contribute significantly to the understanding of air quality dynamics in Gurugram and provide a scalable framework for similar pollution challenges globally. Future work will focus on integrating real-time data streams and exploring the applicability of RegBoost in other geographic regions with varying pollution characteristics.