Enhancing VQA with SELM: A Multi-Model Approach Using SBERT

Main Article Content

Kamala Mekala, Siva Rama Krishna Sarma Veerubhotla

Abstract

In VQA or Visual Question Answering, a model is provided with an image and a natural language question related to it. For the model to generate appropriate answers, it must be able to understand both textual and visual input. However, there are still we have two key challenges persist in VQA.The first challenge is the inconsistency of answers and explanations provided by current approaches. The second is bridging the semantic gap that exists in between images and questions, resulting in explanations that are less accurate. Our goal is to reduce problems between image (any image) visual components and text generation alongside imbalance compensation. We propose a novel approach named System of Ensemble Learning model (SELM).The proposed approach utilizes stacked models for the extraction of text and an image features. The output of the stacked models are taken as input to the multi model fusion transformer (Similarity BERT) The SBERT model compares the predicted output with the actual ground truth results. The proposed SBERT has 95% accuracy, making it better than the state-of-the-art methods. In the future, this model may be extended to different domains like healthcare, geospatial, and satellite images etc.

Article Details

Section
Articles