Performance Analysis of Transformer Based Models for Automatic Short Answer Grading

Main Article Content

Rupal Chaudhari, Manish Patel

Abstract

Automatic Short Answer Grading (ASAG) has gained increasing importance in educational technology, where accurate and scalable assessment solutions are needed. Recent advances in Natural Language Processing (NLP) have introduced powerful Transformer-based models, such as Bidirectional Encoder Representations from Transformers (BERT), Text-to-Text Transfer Transformer (T5), and Generative Pre-trained Transformer 3 (GPT-3), which have demonstrated state-of-the-art performance across various text-based tasks. This paper presents a comparative study of these three models in the context of ASAG, evaluating their effectiveness, accuracy, and efficiency. BERT’s bidirectional encoding, T5’s text-to-text framework, and GPT-3’s autoregressive generation are explored in depth to assess their ability to understand, grade, and generate feedback on short answers. We utilize standard ASAG datasets and multiple evaluation metrics, including accuracy, precision, recall, and F1-score, to measure their performance. The comparative analysis reveals that while all three models exhibit strong capabilities, they vary in handling complex language and ambiguous student responses, with trade-offs in computational cost and scalability. This study highlights the strengths and weaknesses of each model in ASAG and offers insights into their practical applications in educational settings.


Introduction: The automation of grading has become a focal point in modern education systems, driven by the increasing demand for scalable and efficient assessment solutions (Sahu & Bhowmick, 2015). With the proliferation of online learning platforms, digital classrooms, and remote education, the ability to automatically grade short-answer questions has gained significant importance (Gomaa & Fahmy, 2020). Automatic Short Answer Grading (ASAG) seeks to evaluate student responses by comparing them to model answers, often assessing the content’s correctness, relevance, and linguistic features—critical components for evaluating students’ understanding and knowledge retention (Busatta & Brancher, 2018).


Traditional ASAG approaches typically employed rule-based systems, statistical models, and early machine learning algorithms that relied heavily on predefined keywords, templates, or handcrafted features (Tulu et al., 2021). While effective for straightforward, fact-based questions, these systems struggled to capture the complexity and variability of natural language, resulting in reduced grading accuracy—especially for creative or ambiguous responses (Sychev et al., 2019). Consequently, such methods often required significant manual intervention, limiting their scalability and applicability in dynamic educational settings (Muftah & Aziz, 2013).


The advent of deep learning, particularly in the field of Natural Language Processing (NLP), has marked a transformative shift in ASAG (Gaddipati et al., 2020). Neural network-based models have demonstrated a remarkable capacity to learn and generalize from large datasets, enabling a more nuanced understanding of language (Wang et al., 2019). This has led to the development of more robust ASAG systems capable of handling a broader spectrum of student responses, ranging from factual answers to complex explanations (Roy et al., 2016).


A pivotal advancement in NLP is the introduction of the Transformer architecture, which has revolutionized how language models are designed and trained (Vaswani et al., 2017). Transformers excel in processing sequential data through self-attention mechanisms that capture long-range dependencies and contextual relationships within text. This architectural innovation has significantly enhanced performance across a variety of NLP tasks, such as machine translation, sentiment analysis, and question answering (Peters et al., 2018), making Transformer-based models particularly suitable for enhancing ASAG systems (Raffel et al., 2020).


In this paper, we focus on three  prominent Transformer-based models—BERT, T5, and GPT-3—each representing a distinct approach to language understanding and processing. These models have set new benchmarks across numerous NLP tasks, and their potential application in ASAG is substantial


Objectives: The goal of this study is to conduct a comparative analysis of these three Transformer models—BERT, T5, and GPT-3—in the context of ASAG. We evaluate their performance on standard ASAG datasets using multiple evaluation metrics, such as accuracy, precision, recall, and F1-score. Additionally, we analyze the computational efficiency and scalability of these models to determine their practicality for deployment in large-scale educational environments.


Methods: By providing a comprehensive comparison, this study seeks to shed light on the strengths and weaknesses of each model and their suitability for different types of ASAG tasks. Moreover, we aim to offer insights that can guide future research and development in this area, ultimately contributing to the creation of more effective and reliable automated grading systems.


Results: The results of our comparative analysis of BERT, T5, and GPT-3 in the context of Automatic Short Answer Grading (ASAG) reveal important insights into the strengths and limitations of these Transformer models. This section discusses the implications of our findings, the practical considerations for deploying these models in educational settings, and identifies potential avenues for future research.


Conclusions: In conclusion, this study provides a comprehensive comparative analysis of BERT, T5, and GPT-3 for ASAG, highlighting their strengths, limitations, and practical considerations. The insights gained from this research contribute to the ongoing development and refinement of automated grading systems, with the potential to enhance educational assessment and support in diverse learning environments.

Article Details

Section
Articles