StyleVC: Non-parallel Voice Conversion with Adversarial Style Generalization
In-Sun Hwang, Sang-Hoon Lee, Seong-Whan Lee
Abstract
Voice conversion converts the voice while maintaining the language information. It uses two samples to synthesize speech: the source sample is used for content, the target sample is used for style representation. Therefore, VC has been progressed to design information flow to disentangle content and style in a speech. However, separated representations are damaged while passing sparse subspace. Besides, VC models suffer from the training-inference mismatch problem: they only use one sample in training. Accordingly, the model extracts inappropriate content and style representation and generates low-quality speech during inference. To address the mismatch scenario problem, we propose a StyleVC, which utilizes adversarial style generalization. First, we propose style generalization, which captures global style representation and restricts the model from copying information. Second, we use a pitch predictor to estimate pitch information according to content and style representation. Third, we further use adversarial training to make the model generate more realistic speech. Finally, we demonstrate our proposed model can generate high-quality speech. The experimental results also show that the proposed StyleVC significantly outperforms to extract the desired features and improve audio quality during inference.