Hyun-Woo Bae, Hyung-Seok Oh, Seung-Bin Kim and Seong-Whan Lee
With recent globalization trends, the importance of language education for nonnative speakers has significantly increased. Research on pronunciation correction and speech editing continues to be actively explored to reduce the pronunciation gap between native and nonnative speakers. Traditional speech editing models frequently suffer from unnaturalness at the boundary of the corrected parts and demand substantial input information to effectively synthesize corrected speech. Furthermore, the scarcity of data for curated annotations of mispronounced speech hinders training. This paper introduces UnitCorrect, a mispronunciation correction model that leverages self-supervised unit representations for synthesizing natural-sounding speech and utilizing more phoneme information from speech effectively. We also propose a dynamic time warping based mispronunciation detection system with reduced informational requirements. This system automatically identifies mispronounced segments by comparing the input audio with the target text. Moreover, we supplement the insufficient text information by incorporating frame-level text conditioning into the input of the decoder, therby enabling the synthesis of high-quality speech using a conditional flow-matching decoder. The experimental results demonstrate that UnitCorrect outperforms existing models in terms of mispronunciation corrections and naturalness of speech.