StyleVC: Non-parallel Voice Conversion with Adversarial Style Generalization


In-Sun Hwang, Sang-Hoon Lee, Seong-Whan Lee

Abstract



Voice conversion converts the voice while maintaining the language information. It uses two samples to synthesize speech: the source sample is used for content, the target sample is used for style representation. Therefore, VC has been progressed to design information flow to disentangle content and style in a speech. However, separated representations are damaged while passing sparse subspace. Besides, VC models suffer from the training-inference mismatch problem: they only use one sample in training. Accordingly, the model extracts inappropriate content and style representation and generates low-quality speech during inference. To address the mismatch scenario problem, we propose a StyleVC, which utilizes adversarial style generalization. First, we propose style generalization, which captures global style representation and restricts the model from copying information. Second, we use a pitch predictor to estimate pitch information according to content and style representation. Third, we further use adversarial training to make the model generate more realistic speech. Finally, we demonstrate our proposed model can generate high-quality speech. The experimental results also show that the proposed StyleVC significantly outperforms to extract the desired features and improve audio quality during inference.

The code will be released at https://github.com/intory89/StyleVC

Seen Demo(Non-parallel data)



Female to Male conversion

Source speech
Target speech
AutoVC
AGAIN-VC
VQVC+
StyleVC(Ours)


Female to Female conversion

Source speech
Target speech
AutoVC
AGAIN-VC
VQVC+
StyleVC(Ours)


Male to Female conversion

Source speech
Target speech
AutoVC
AGAIN-VC
VQVC+
StyleVC(Ours)


Male to Male conversion

Source speech
Target speech
AutoVC
AGAIN-VC
VQVC+
StyleVC(Ours)

Unseen Demo(Non-parallel data)




Female to Male conversion

Source speech
Target speech
AutoVC
AGAIN-VC
VQVC+
StyleVC(Ours)


Female to Female conversion

Source speech
Target speech
AutoVC
AGAIN-VC
VQVC+
StyleVC(Ours)


Male to Female conversion

Source speech
Target speech
AutoVC
AGAIN-VC
VQVC+
StyleVC(Ours)


Male to Male conversion

Source speech
Target speech
AutoVC
AGAIN-VC
VQVC+
StyleVC(Ours)