Comprehensive Analysis of Arabic Tokenization System Preprocessing using the Matching Model

Main Article Content

Ibrahim Abdelfattah Almajali, Mutlaq Moraya Nafah Alharbi

Abstract

This research paper proposes a novel Arabic word tokenization system based on the knowledge Word tokenization is the first stage for higher-order Natural Language Processing (NLP) tasks like Part-of-Speech (PoS) tagging, parsing, and named entity recognition. The amount of text on the World Wide Web is growing daily in the present era of technology, necessitating the use of advanced instruments. Since more and more people speak Arabic around the world, Arabic language processing systems must be improved. Due to the writing style of Arabic with a lack of support for capitalization features and the use of compound words, it is difficult to perform word tokenization. Arabic's inconsistent usage of space between words makes it difficult to tokenize words because of its cursive form. Word tokenization plays a vital role in all aspects of natural language processing. Different applications can be created once words have been tokenized. To develop this system, a maximum matching model with its two variations, namely forward and reverse maximum matching is used. The proposed system is implemented in Python. The results produced during system evaluation report high performance.

Article Details

Section
Articles