DCVC Demo

ABSTRACT

Several voice conversion (VC) methods using a simple autoencoder with a carefully designed information bottleneck have recently been studied. In general, they extract content information from a given speech through the information bottleneck between the encoder and the decoder, providing it to the decoder along with the target speaker information to generate the converted speech. However, their performance is highly dependent on the downsampling factor of an information bottleneck. In addition, such frame-by-frame conversion methods cannot convert speaking styles associated with the length of utterance, such as the duration. In this paper, we propose a novel duration controllable voice conversion (DCVC) model, which can transfer the speaking style and control the speed of the converted speech through a phoneme-based information bottleneck. The proposed information bottleneck does not need to find an appropriate downsampling factor, achieving a better audio quality and VC performance. In our experiments, DCVC outperformed the baseline models with a 3.78 MOS and a 3.83 similarity score. It can also smoothly control the speech duration while achieving a 39.35x speedup compared with a Seq2seq-based VC in terms of the inference speed.

TRADITIONAL VC

Source Speaker	Target Speaker	Converted
p227 (Male)	p232 (Male)	StarGAN-VC
		AutoVC
		Seq2seq-VC
		DCVC
	p303 (Female)	StarGAN-VC
		AutoVC
		Seq2seq-VC
		DCVC
p303 (Female)	p227 (Male)	StarGAN-VC
		AutoVC
		Seq2seq-VC
		DCVC
	p228 (Female)	StarGAN-VC
		AutoVC
		Seq2seq-VC
		DCVC

ZERO-SHOT VC

SEEN-TO-UNSEEN

Source Speaker	Target Speaker	Converted
p232 (Male)	p246 (Male)	AutoVC
		Seq2seq-VC
		DCVC
	p335 (Female)	AutoVC
		Seq2seq-VC
		DCVC
p228 (Female)	p246 (Male)	AutoVC
		Seq2seq-VC
		DCVC
	p335 (Female)	AutoVC
		Seq2seq-VC
		DCVC

UNSEEN-TO-SEEN

Source Speaker	Target Speaker	Converted
p246 (Male)	p232 (Male)	AutoVC
		Seq2seq-VC
		DCVC
	p228 (Female)	AutoVC
		Seq2seq-VC
		DCVC
p335 (Female)	p232 (Male)	AutoVC
		Seq2seq-VC
		DCVC
	p228 (Female)	AutoVC
		Seq2seq-VC
		DCVC

UNSEEN-TO-UNSEEN

Source Speaker	Target Speaker	Converted
p245 (Male)	p246 (Male)	AutoVC
		Seq2seq-VC
		DCVC
	p335 (Female)	AutoVC
		Seq2seq-VC
		DCVC
p261 (Female)	p246 (Male)	AutoVC
		Seq2seq-VC
		DCVC
	p335 (Female)	AutoVC
		Seq2seq-VC
		DCVC

DURATION CONTROL

Source Speaker	Target Speaker	Converted (AutoVC)
p227 (Male)	p232 (Male)	VERY SLOW
		SLOW
		NORMAL
		FAST
		VERY FAST
	p303 (Female)	VERY SLOW
		SLOW
		NORMAL
		FAST
		VERY FAST
p303 (Female)	p227 (Male)	VERY SLOW
		SLOW
		NORMAL
		FAST
		VERY FAST
	p228 (Female)	VERY SLOW
		SLOW
		NORMAL
		FAST
		VERY FAST