A Frequency-Based Approach to Stop word Detection for Enhanced Clustering
Main Article Content
Abstract
Languages are the most beautiful way we get to communicate with one another. In India, we have a plethora of languages that are used in our urban areas and remote rural regions, each having their dialect and tone. The Tamil language is one of the oldest languages to ever exist in the world. Tamil, a Dravidian language, boasts a rich history and unique features, including being one of the oldest living languages with a literature spanning over two millennia, and it is the first Indian language to be printed and published.
Clustering unstructured text data is a significant challenge in natural language processing, especially for low-resource languages like Tamil [1]. Agglomerative clustering is a hierarchical clustering algorithm that follows a bottom-up approach, progressively merging individual data points into clusters based on similarity. Unlike partition-based methods, it does not require a predefined number of clusters, making it advantageous for exploratory data analysis [10]. This paper explores a proposed methodology that includes dynamic stopword identification, language-specific preprocessing, and sentence embedding using BERT. The embeddings are then normalized using L2 normalization, followed by dimensionality reduction with UMAP. This approach leads to improved clustering performance as indicated by favourable metric values. For identifying the dynamic stop words the method that have been proposed is frequency-based approach and in which the final dynamic stop words are obtained by combining the common static words.