Thai Sentence Completeness Classification using Fine-Tuned WangchanBERTa

Pattarapol Pornsirirung

doi:10.52783/jisem.v10i34s.5821

PDF

Published: Apr 12, 2025

DOI: https://doi.org/10.52783/jisem.v10i34s.5821

Keywords:

Natural Language Processing, Thai Language, BERT, Sentence Classification, Transformer Models.

Pattarapol Pornsirirung, Khantharat Anekboon

Abstract

Sentence completeness classification plays a crucial role in various natural language processing (NLP) applications, including grammar checking, text auto-completion, and language assessment. This task becomes particularly challenging in Thai due to the language’s unique characteristics such as flexible word order, implicit subject omission, and the absence of explicit word boundaries. These linguistic properties make traditional rule-based and statistical approaches prone to errors when applied to Thai. To address these challenges, this research applies modern deep learning techniques, specifically leveraging pre-trained transformer models fine-tuned for Thai sentence completeness classification. This study introduces the use of WangchanBERTa, a Thai-specific adaptation of RoBERTa, pre-trained. Two thousand Thai sentences are created. There are one thousand complete sentences and one thousand incomplete sentences. Each sentence was manually labeled to ensure high data quality. Experimental results show that WangchanBERTa achieves an average accuracy of 99.65%, significantly outperforming mBERT, a popular multilingual baseline, which achieved only 95.82%. Notably, with a Tesla T4 GPU, WangchanBERTa required just 1 hour and 15 minutes to train across all folds, compared to mBERT’s 2 hours and 59 minutes. Additionally, WangchanBERTa’s performance was compared with XLM-R, a state-of-the-art multilingual model, which achieved a slightly higher accuracy of 99.90% but at the cost of higher computational requirements. The results emphasize the advantage of language-specific pretraining in capturing the linguistic nuances of Thai. This research highlights the importance of tailored transformer models for low-resource languages. By demonstrating that WangchanBERTa achieves near state-of- the-art performance with lower computational cost, this work provides a strong foundation for future Thai NLP research.

Issue

Vol. 10 No. 34s (2025)

Section

Articles

Journal of Information Systems Engineering and Management

Thai Sentence Completeness Classification using Fine-Tuned WangchanBERTa

Abstract

Volume 10 (2025)

Volume 9 (2024)

Volume 8 (2023)

Volume 7 (2022)

Volume 6 (2021)

Volume 5 (2020)

Volume 4 (2019)

Volume 3 (2018)

Volume 2 (2017)

Volume 1 (2016)

Journal of Information Systems Engineering and Management

Article Sidebar

Main Article Content

Abstract

Article Details