Ensemble Deep Learning for Author Attribution in Assamese Text Documents: A Hybrid Approach

Main Article Content

Smriti Priya Medhi, Shikhar Kumar Sarma

Abstract

Introduction: Author attribution is a critical task in the domain of computational linguistics, particularly when dealing with low-resourced languages. These languages often lack sufficient annotated datasets and robust linguistic tools, making natural language processing (NLP) applications challenging. A piece of text generally reflects the unique stylistic traits of its author, including their use of vocabulary, punctuation, and sentence structures. Identifying the correct author based on these textual cues is the central aim of author attribution. Despite its importance, this area remains relatively unexplored for low-resourced languages due to the inherent data and resource limitations.


Objectives: This study aims to address the problem of author attribution in low-resourced languages by leveraging deep learning techniques. Specifically, the objectives are to investigate how neural network models like Recurrent Neural Networks (RNN) and hybrid Convolutional Neural Network with Long Short-Term Memory (CNN-LSTM) can effectively capture stylistic features of text, and to design an ensemble model that combines these approaches for enhanced performance in multi-author scenarios.


Methods: To achieve the stated objectives, we employed RNN and CNN-LSTM architectures to model the stylistic nuances present in textual data. These models were trained and tested on the AAALC dataset, which contains writings from multiple authors in a low-resourced language. The performance of each model was evaluated using key classification metrics such as accuracy and F1-score. Additionally, an ensemble model combining the outputs of RNN and CNN-LSTM was proposed to further boost classification performance by leveraging the strengths of both architectures.


Results: The results of our experiments demonstrated that deep learning models are highly effective in the context of author attribution for low-resourced languages. Among the evaluated models, the ensemble approach achieved the best results, with an accuracy of 84.38% and an F1-score of 79.0%. These results indicate that combining neural network architectures can significantly improve classification accuracy by capturing a richer representation of stylistic features.


Conclusions: In conclusion, this study validates the potential of neural network-based models for tackling the author attribution task in low-resourced languages. The proposed ensemble model not only achieved strong performance metrics but also demonstrated the practical viability of deep learning approaches in linguistic contexts where traditional resources are scarce. These findings contribute to the advancement of inclusive NLP systems and offer valuable insights for further research in multilingual and resource-constrained settings.

Article Details

Section
Articles