Commonsense-based Visual-Linguistic Reasoning for Document Filtering using Multimodal Large Language Models
Main Article Content
Abstract
In many real-world scenarios, users need to sift through large collections of image-based documents to find those containing personal or contextually important information, such as names, email addresses, or phone numbers. Manual filtering is inefficient and error-prone, especially when dealing with unstructured visual data. To address this challenge, we propose an intelligent, automated filtering pipeline that combines cutting-edge techniques from NLP, computer vision, and commonsense reasoning. Our system integrates optical character recognition (OCR) to extract textual content from images, followed by textual entailment models and pattern recognition to understand the relevance of extracted entities in context. A key innovation of our approach is the introduction of Commonsense-based Visual-Linguistic Reasoning (CVLR) — a framework that incorporates knowledge graphs and multimodal large language models (LLMs) to enhance the system’s ability to infer context and intent behind visual information. We fine-tune state-of-the-art multimodal LLMs on a custom dataset of 2,000+ image documents, enabling accurate classification of document types (e.g., invoices, ID cards, certificates) and intelligent filtering based on user-defined relevance criteria. This results in a robust solution capable of identifying documents that matter to the user, even when explicit identifiers are partially obscured or contextually implied.