Evaluating the Effectiveness of Implementing the SauDiSenti Lexicon in Saudi Dialect Sentiment Analysis
Main Article Content
Abstract
Introduction: One of the most important methods for identifying an individual’s general emotional state or opinion toward a particular topic is to do a sentiment analysis.
Objectives: This study evaluates the effectiveness of the SauDiSenti lexicon in conducting sentiment analysis of Saudi dialect tweets. SauDiSenti is the only one that is available publicly for the Saudi dialect to look into the issue of insufficient linguistic resources for conducting Arabic sentiment analysis.
Methods: A total of 27,000 tweets were collected and preprocessed from the discussions around the STC Pay platform and then analyzed through manual annotation, lexicon-based, and machine-learning approaches. To evaluate the performance of these methods Metrics such as accuracy, precision, recall, and F1 score were computed.
Results: The lexicon-based approach achieved an overall accuracy of 92% and F1 scores of 89.7% and 83.8% for positive and negative sentiments, respectively. While the annotation-based approach provided more accurate sentiment classification with approximately 50.27% of neutral tweets, 31.13% of positive, and 18.6% of negative tweets. Furthermore, the Support Vector Machine (SVM), Naïve Bayes (NB), and Logistic Regression (LR) classifiers were developed and trained on the documents created by each of the methods. Compared with the rest, SVM outperformed the other two classifiers yielding 93% and 96% accuracy with the lexicon-based corpus and annotation-based model.
Conclusions: These results show that even though the SauDiSenti lexicon is compact, it still serves as a reliable tool for analysis across various topics. There exists a trade-off where manual annotation guarantees specialized precision, but the lexicon offers a practical alternative for domain-specific projects. The study recommends expanding the lexicon’s size and diversity to its applicability for broader datasets.