Image Caption Generation Using Deep Learning

Main Article Content

D Prannav,Adnan Anwar,Sunayana S, Shravya A R, Chandrashekar Patil

Abstract

Image caption generation, a primary application domain in computer vision and natural language processing, produces text captions of images from deep learning models. The current paper suggests a CNN-LSTM-based system for automatic captioning, where pre-trained convolutional neural networks (CNNs) are employed for image feature extraction and long short-term memory (LSTM) networks for sequential text generation. Inspired by the Flickr8k dataset, the paper emphasizes primary challenges such as vocabulary sparsity, overfitting, and computational complexity. Experimental results achieve BLEU scores of 0.66 or more, exhibiting coherent caption generation and qualitative analysis discloses captioning inefficiencies for complex scenes. The paper also discusses future enhancements such as transformer-based architectures and attention mechanisms to improve caption accuracy and accessibility. The work contributes to improving large-scale human-computer interaction through multimodal AI systems. Caption generation is an important area at the intersection of computer vision and natural language processing, including the generation of descriptive text captions describing images using advanced deep-learning methodologies. Current paper suggests a new approach through a hybrid CNN-LSTM-based system for automatic captioning. This state-of-the-art model employs pre-trained convolutional neural networks (CNNs) for robust image feature extraction to identify and interpret relevant features in an image. These identified features are then fed to long short-term memory (LSTM) networks adept at generating coherent and relevant sequential text based on the visual input. The experimental results revealed excellent BLEU scores of 0.66 or higher, which reflects the model's capacity to generate captions not only accurate but also linguistically sound. Qualitative analysis of the generated captions does call out inefficiencies in handling complicated scenes with more than one element or activity, and it suggests where there is potential for improvement in the future. In the future, the paper foresees potential enhancements, such as the application of transformer-based models and attention, which would significantly improve caption accuracy and user experience for accessibility. Overall, this work contributes to advancing the state of large-scale human-computer interaction by developing sophisticated multimodal AI systems for interpreting and generating human-like text from visual inputs.

Article Details

Section
Articles