Stress Detection using Multimodal Representation Learning, Fusion Techniques, and Applications

Main Article Content

Yogesh J. Gaikwad, Kalyani Kadama, Rutuja Rajendra Patil, Gagandeep Kaur

Abstract

The fields of speech recognition, image identification, and natural language processing have undergone a paradigm shift with the advent of machine learning and deep learning approaches. Although these tasks rely primarily on a single modality for input signals, the artificial intelligence field has various applications that necessitate the use of several modalities. In recent years, academics have placed a growing emphasis on the intricate topic of modelling and learning across various modalities. This has attracted the interest of the scientific community. This technical article provides a comprehensive analysis of the models and learning methods available for multimodal intelligence. Specifically, this work concentrates on the fusion of video and language processing modalities, which has become a crucial area in both computer vision and natural language research. In this article, we explore recent research on multimodal deep learning from three different perspectives: learning multimodal representations, combining multimodal inputs at different levels, and multimodal applications. Regarding the learning of multimodal representations, the article delves into the concept of embedding, which involves the combination of different types of signals into a unified vector space. This enables cross-modal signal processing, which has significant implications for various applications. Moreover, several forms of embedding created and trained for common downstream tasks are examined. Regarding multimodal fusion, the research focuses on specific designs that merge representations of unimodal inputs for a specific purpose.

Article Details

Section
Articles