Multi Head Attention Transformer for Arabic Scene Images Text Recognition

Main Article Content

Oualid KHIAL, Fatma BOUFERRA

Abstract

The worldwide video library continues to expand rapidly, which creates an increasing need for modern and reliable techniques for video processing and text indexing. In this paper, we introduce a new implementation of the Transformer architecture for scene text recognition. This work comes from a comparative study between two approaches: using convolutional feature maps as input to the Transformer encoder, and fully removing any CNN component. During training, we used almost all available public datasets; however, they were still not enough because of the significant lack of large-scale and diverse datasets for this task. This challenge led us to create and publish a new artificial dataset called IYaD. The IYaD dataset currently contains around 1,400,000 images for one font and the same scale for 16 additional fonts. Each image is provided in three different versions and includes Arabic labels, Latin transcription, and the text content. The experimental results show that our Transformer-based ASTR model surpasses state-of-the-art methods, especially when trained on the IYaD dataset, establishing new benchmarks in accuracy and robustness. We believe that this dataset demonstrates the importance and potential of artificially created datasets, and it may encourage similar dataset generation in other research domains.

Article Details

Section
Articles