Analyzing Various Machine Learning Algorithms for Opinion Extraction from Web Text Using AI Across Multiple Datasets

Main Article Content

Erugu Krishna, Sonawane Vijay Ramnath

Abstract

Opinion extraction from web text is essential for understanding public attitudes in e-commerce, news, and social media, yet it remains challenging due to noisy language, short informal messages, and inconsistent sentiment labels. This study proposes a unified AI-driven pipeline for three-class sentiment classification (positive, neutral, negative) across multiple web-text domains. The workflow performs label normalization, missing-value removal, de-duplication, and text cleaning (URL/mention removal, hashtag normalization, and whitespace standardization). Cleaned text is represented using TF–IDF with unigram and bigram features and evaluated using twelve classic machine learning classifiers, with a focus on LinearSVM and Calibrated LinearSVM for robust discrimination and probability-based analysis. Experiments are conducted on three datasets: product reviews, Times of India headlines, and English political tweets. Performance is assessed using accuracy, precision, recall, F1-score, confusion matrices, and OvR ROC/precision–recall curves. On the Times of India dataset, LinearSVM achieves the best accuracy of 0.894, while Calibrated LinearSVM attains a comparable accuracy of 0.893, demonstrating strong and consistent performance for headline sentiment classification. The results indicate that TF–IDF combined with linear margin-based models provides an effective and scalable baseline for multi-domain opinion extraction.

Article Details

Section
Articles