Improving Conformer based End-To-End Manipuri Automatic Speech Recognition using Wav2vec2 model
Main Article Content
Abstract
In order for end-to-end speech recognition systems to function successfully, a lot of labeled speech data is required for training. Due to the availability of huge labeled voice corpora for high-resource languages like English, this condition tips the scales in favor of those languages. On the other hand, transcriptions of speech for most languages spoken around the world are scarce. This work builds a Conformer based end-to-end automatic speech recognition system for Manipuri, one of the Indian and low resource languages. ULCA (Unified Language Contribution APIs) Manipuri speech corpus of around 10 hours is used. The proposed method uses two approaches for extracting features- Log Mel and wav2vec2 speech features. Log Mel features are extracted from the input speech signals and speed perturbation technique is used to effectively increase the amount of training data and wav2vec2 model are used as a pre-encoder for speech features. Word Error Rate (WER) and Character Error Rate (CER) are used to gauge how well the trained conformer-based model performs. The best performance achieved by the proposed system is 29.7% WER and 8.3% CER. The results are compared with the baseline LSTM based ASR system and it was found that the proposed system gave an absolute improvement of 34% in WER and 23.5% in CER.