Attention-Enhanced Deep Feature Fusion: A Comprehensive Framework for Multimodal Sentiment Analysis
Main Article Content
Abstract
Multimodal Sentiment Analysis (MSA) is an evolving area absorbed on capturing multifaceted human feelings by integrating diverse data types, inclusive of text, audio, and visuals. Leveraging multiple modalities enhances sentiment prediction by enriching the representational capacity of model architectures. Addressing the challenges in this domain, we put forward a novel approach that unites advanced feature extraction techniques—BERT for textual data, Wave2Vec2 for acoustic data, and Vision Transformer for visuals—with BiLSTM networks augmented by self-attention and multi-head attention mechanisms. The proposed architecture effectively extracts and fuses modality-specific features to construct a robust multimodal representation. Evaluated on the benchmark CMU-MOSI dataset, the projected model achieves 82.45% accuracy with the self-attention mechanism and 86.04% with the multi-head attention mechanism, surpassing several transformer-based states of the art approaches. This superiority stems from the handcrafted feature extraction, effective fusion strategy, and the ability of BiLSTM with multi-head attention to seize diverse relationships without overfitting. The findings highlight that hybrid architectures that integrate the advantages of transformers with attention-augmented recurrent models can outperform pure transformer-based designs for multimodal sentiment analysis.