Comparative Analysis of CNN Architectures for English and Gujarati Speech Recognition Using MFCC Features
Main Article Content
Abstract
The paper investigates the efficiency of Convolutional Neural Network (CNN) architectures for speech recognition, focusing on the English and Gujarati languages. The study explores the impact of different CNN layer depths, utilizing 2, 3, and 4-layer configurations. Mel-Frequency Cepstral Coefficients (MFCC) is employed for feature extraction before feeding the data into the CNN models. The activation functions Rectified Linear Unit (ReLU) and hyperbolic tangent (tanh) are examined across all architectures. The research uses the Speech Commands dataset for English and a Gujarati digits dataset for analysis. After preprocessing and MFCC feature extraction, CNNs with varying depths and ReLU activation are employed. Training encompasses both languages, exploring parameters for balanced performance and efficiency, emphasizing tailored solutions for diverse linguistic contexts. The results reveal that the ReLU consistently yields superior performance on both the English and Gujarati datasets. In addition, the study found that increasing the depth of the CNN layers does not necessarily lead to improved recognition accuracy. The findings underscore the importance of selecting appropriate activation functions, highlight the nuanced relationship between CNN depth and recognition performance, and contribute to the understanding of CNN architecture optimization for speech recognition tasks in diverse linguistic contexts. The insights gained can inform the design of more effective speech recognition systems for globally recognized languages, such as English, and vernacular languages like Gujarati.