Audio Deepfake Detection with Stacking Ensemble: Performance vs. Generalization

Main Article Content

Zeltni Kamel, Habbati Billel

Abstract

The rapid progress in generative speech technologies has yielded highly realistic audio deepfakes, posing substantial challenges to security, trust, and the credibility of media. This research proposes a stacking ensemble method for audio deepfake detection. The method utilizes XGBoost and Random Forest algorithms as base learners, and a Multilayer Perceptron (MLP) as a meta-learner. The performance of the model, assessed via ten-fold cross-validation, is noteworthy, with accuracy and Matthews Correlation Coefficient (MCC) exceeding 94% and 0.88 respectively. However, when tested on an unseen dataset, the model suffers a drastic performance drop, indicating limited generalization. To investigate the cause, we conducted two targeted experiments: (1) extensive data augmentation to regularize training and (2) intentional underfitting by reducing model capacity. The lack of improvement in test performance with both strategies rules out overfitting as the primary problem. Instead, our findings point to a deeper problem of distributional shift between training and deployment domains. However, the findings indicate the practical feasibility of established machine learning methods in scenarios with limited resources or real-time constraints, given their efficiency and comparable performance. This work underscores the need for domain-robust feature representations, cross-dataset validation, and scalable solutions for real-world audio deepfake detection.

Article Details

Section
Articles