Consumer Insights at Scale: Integrating ML Pipelines, Scalable Data Engineering, and Distributed Analytics in Multi-Cloud Ecosystems
Main Article Content
Abstract
In the era of data-driven decision-making, extracting actionable consumer insights at scale is critical for enterprise competitiveness. This study presents an integrated framework that combines machine learning (ML) pipelines, scalable data engineering, and distributed analytics within multi-cloud ecosystems to generate real-time, predictive consumer intelligence. Leveraging federated learning, the system enables decentralized model training across AWS, Azure, and GCP while preserving data privacy and regional compliance. The data engineering backbone, built on technologies such as Apache Kafka, Spark, and Airflow, ensures high-throughput, fault-tolerant data processing. Experimental results reveal that Gradient Boosting Machines achieved the highest AUC scores (up to 0.95), with significant regional variations in model performance validated through ANOVA and post-hoc testing. Time-series forecasting using Prophet outperformed ARIMA across all metrics, while throughput scalability tests demonstrated linear performance gains with increased compute clusters, reaching up to 2 million events per second. The proposed architecture not only enhances the granularity and speed of consumer insight generation but also ensures operational resilience and flexibility across decentralized cloud infrastructures. This research contributes a practical, modular solution for organizations seeking to unify and scale their analytics efforts in dynamic, multi-cloud environments.