Duration Controllable Voice Conversion via Phoneme-Based Information Bottleneck

Sang-Hoon Lee, Hyeong-Rae Noh, Woo-Jeoung Nam, and Seong-Whan Lee, Fellow, IEEE

ABSTRACT

Several voice conversion (VC) methods using a simple autoencoder with a carefully designed information bottleneck have recently been studied. In general, they extract content information from a given speech through the information bottleneck between the encoder and the decoder, providing it to the decoder along with the target speaker information to generate the converted speech. However, their performance is highly dependent on the downsampling factor of an information bottleneck. In addition, such frame-by-frame conversion methods cannot convert speaking styles associated with the length of utterance, such as the duration. In this paper, we propose a novel duration controllable voice conversion (DCVC) model, which can transfer the speaking style and control the speed of the converted speech through a phoneme-based information bottleneck. The proposed information bottleneck does not need to find an appropriate downsampling factor, achieving a better audio quality and VC performance. In our experiments, DCVC outperformed the baseline models with a 3.78 MOS and a 3.83 similarity score. It can also smoothly control the speech duration while achieving a 39.35x speedup compared with a Seq2seq-based VC in terms of the inference speed.

TRADITIONAL VC

Source Speaker Target Speaker Converted

p227 (Male)

p232 (Male)

StarGAN-VC

AutoVC

Seq2seq-VC

DCVC

p303 (Female)

StarGAN-VC

AutoVC

Seq2seq-VC

DCVC

p303 (Female)

p227 (Male)

StarGAN-VC

AutoVC

Seq2seq-VC

DCVC

p228 (Female)

StarGAN-VC

AutoVC

Seq2seq-VC

DCVC

ZERO-SHOT VC

SEEN-TO-UNSEEN

Source Speaker Target Speaker Converted

p232 (Male)

p246 (Male)

AutoVC

Seq2seq-VC

DCVC

p335 (Female)

AutoVC

Seq2seq-VC

DCVC

p228 (Female)

p246 (Male)

AutoVC

Seq2seq-VC

DCVC

p335 (Female)

AutoVC

Seq2seq-VC

DCVC

UNSEEN-TO-SEEN

Source Speaker Target Speaker Converted

p246 (Male)

p232 (Male)

AutoVC

Seq2seq-VC

DCVC

p228 (Female)

AutoVC

Seq2seq-VC

DCVC

p335 (Female)

p232 (Male)

AutoVC

Seq2seq-VC

DCVC

p228 (Female)

AutoVC

Seq2seq-VC

DCVC

UNSEEN-TO-UNSEEN

Source Speaker Target Speaker Converted

p245 (Male)

p246 (Male)

AutoVC

Seq2seq-VC

DCVC

p335 (Female)

AutoVC

Seq2seq-VC

DCVC

p261 (Female)

p246 (Male)

AutoVC

Seq2seq-VC

DCVC

p335 (Female)

AutoVC

Seq2seq-VC

DCVC

DURATION CONTROL

Source Speaker
Target Speaker
Converted (AutoVC)
Converted (DCVC)

p227 (Male)

p232 (Male)

VERY SLOW

 

SLOW

 

NORMAL

 

FAST

 

VERY FAST

 

p303 (Female)

VERY SLOW

 

SLOW

 

NORMAL

 

FAST

 

VERY FAST

 

p303 (Female)

p227 (Male)

VERY SLOW

 

SLOW

 

NORMAL

 

FAST

 

VERY FAST

 

p228 (Female)

VERY SLOW

 

SLOW

 

NORMAL

 

FAST

 

VERY FAST