Deep Multimodal Fusion for Ingredient Prediction from Food Images and Recipe Descriptions
Main Article Content
Abstract
Growing incorporation with the artificial intelligence (AI) to the food informatics makes the intelligent culinary systems possible where they can carry out several complex tasks like ingredient prediction, dietary analysis, and automated recipe generation. In this research, a novel multimodal AI based framework to predict food ingredient from heterogeneous input modalities, i.e., images, and their corresponding (English) textual description, is presented. As a two stage system, the proposed system uses convolutional neural networks (CNNs) to extract visual features from images and the transformer-based models for extracting features from textual information, and the system can accurately identify normal and rare objects. In order to provide robust training and testing, a dataset including a range of cuisines is developed composed of curated and annotated examples. An attention based multimodal fusion strategy is used by the system for fusing the visual and textual embeddings dynamically helping the system predict the ingredient effectively even in cases with partially or ambiguously entered information. Experimental results show that the proposed approach outperforms unimodal and early fusion baselines with Top-1 accuracy of 82.7%, mean average precision (mAP) of 74.6%, and F1-score of 80.1%. Additionally an ablation study is conducted to validate contribution of each system component and validate effectiveness of attention driven fusion mechanisms. In addition, the model shows strong generalization to regional food variation as well as for dietary personalization. Contributions to the advancement of AI driven food analytics to build a scalable, adaptable and accurate ingredient prediction model are made. Potential uses include smart kitchen systems, tracking of individual’s nutrition and health monitoring. Extensions in the future may include merging sensory knowledge, increasing performance of the model, and enlarging support for multilingual as well as culturally particular culinary data.