Approach to Handle Compound Out of Vocabulary Words in Hindi Web Queries

Main Article Content

Amit Asthana, Ganesh Chandra, Sanjay K. Dwivedi

Abstract

Introduction: Detection and handling Out of Vocabulary (OOV) words in information retrieval is a challenging task. This problem may become more challenging in case of Cross-Lingual Information Retrieval (CLIR) due to the complications with query translation. Compound Hindi OOV word problem has been less discussed in literature and no appropriate solution has been provided to overcome the issue in CLIR with web queries having such words. These words if not identified, may restrict to understand the proper meaning.


Objectives: The objective of this paper is to understand the impact and to handle web queries involving compound Hindi out of vocabulary words on the retrieval effectiveness.


Methods: This paper proposes an algorithm to detect and handle the compound Hindi OOV words in web queries. The algorithm has been applied on two categories of web query (i.e. having only compound Hindi OOV word and having ateast one such word in it) and the retrieval efficiency is calculated in terms of precision and average precision.


Results: Result highlight an improvement of 8.53% for one-word web queries having only OOV word and 15.68% with queries having more than one word involving at least one OOV word respectively.


Conclusions: The results of the work indicate that the proposed approach detects and handles specific type of OOV words present in Hindi web queries. Improvement in retrieval effectiveness for both types of queries has been observed. Queries having more than one word show better improvements than the queries with single OOV word. This could be attributed to the contextual information offered by the neighboring words.

Article Details

Section
Articles