Prosody Predictor based Diffusion Models Techniques for Enhanced Speech Synthesis
Main Article Content
Abstract
A prosody predictor based on a diffusion model is crucial to the new zero shot approach of voice synthesis. Since diffusion models excel at capturing complicated distributions, they are perfect for simulating the complex patterns of prosody in speech. These models have recently attracted interest in a number of generative tasks. By repeatedly changing an initial chaotic input into an output that nearly matches the intended goal, a diffusion model acts by gradually refining the input. The diffusion model iteratively refines an initial rough estimate of the prosody pattern in the context of prosody prediction. To get realistic sounding speech, it is necessary to capture small prosody fluctuations in pitch, length, and loudness. This approach enables the model to do just that. Training on massive speech corpora teaches the diffusion model-based prosody predictor to mimic reference speech in its prosody pattern generation. During inference, the model makes use of the learnt prosody patterns to anticipate the target speech’s prosody, guaranteeing that the produced speech is expressive and authentic, even while the speaker is unseen.