Comparative Study of Unimodal and Multimodal Systems Based on MNN

Main Article Content

Hajar Filali, Chafik Boulealam, Hamid Tairi, Khalid El fazazy, Adnane Mohamed Mahraz, Jamal Riffi

Abstract

Emotion recognition has emerged as a pivotal area in the development of emotionally intelligent systems, with research traditionally focusing on unimodal approaches. However, recent advancements have highlighted the advantages of multimodal systems, which leverage complementary inputs such as text, speech, and visual cues. This study conducts a comparative analysis of unimodal and multimodal emotion recognition systems based on the Meaningful Neural Network (MNN) architecture. Our approach integrates advanced feature extraction techniques, including a Graph Convolutional Network for acoustic data, a Capsule Network for textual data, and a Vision Transformer for visual data. By fusing these modalities, the MNN model is capable of learning more meaningful representations and achieving superior accuracy. The proposed model is evaluated on two public datasets, MELD [1], [2] and MOSEI [3]. On the MELD dataset, the unimodal system achieved an accuracy of 79.5%, while the multimodal system reached 86.69%. On the MOSEI dataset, the unimodal system attained an accuracy of 47%, whereas the multimodal system achieved 56%. These results demonstrate the effectiveness of multimodal systems over unimodal approaches, particularly when employing sophisticated neural network architectures like MNN.

Article Details

Section
Articles