Multimodal Natural Language Processing: Integrating Text, Vision, And Speech for Enhanced Artificial Intelligence Understanding
Main Article Content
Abstract
Multimodal sentiment analysis is a developing research area that aims at using multiple and diverse inputs like text, speech and vision to increase the efficiency of emotion identification. In this research, the MELD dataset is used for the classification of sentiment using an integration of Random Forest (text), SVM (speech), and ANN/CNN (vision). The results reveal that the vision models outcompeted ANN with the models attaining 75% accuracy, the Random Forest attained about 56%, while SVM has a seventeen percent tested speech rate and most often misclassified sentiments as the neutrality. The approaches are also in line with the need to perform multimodal fusion in which the talk, text and vision modes are used in a complementary manner in order to minimise the classification error. The areas to be improved in the future are the transformer text and speech models (BERT, Wav2Vec), the attention-based CNNs for facial analysis, and advanced fusion methods, such as early, late fusion, and hybrid fusion. Multimodal sentiment analysis has real-world uses in areas such as human-computer interaction, monitoring the sentiment in customer behaviour with the help of AI systems, tracking mental health conditions, and moderation of content in social media.