DurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations without Text Alignment

Hyung-Seok Oh, Sang-Hoon Lee, Deok-Hyun Cho and Seong-Whan Lee

Abstract

Emotional voice conversion (EVC) involves modifying various acoustic characteristics, such as pitch and spectral envelope, to match a desired emotional state while preserving the speaker's identity. Existing EVC methods often rely on text transcriptions or time-alignment information and struggle to handle varying speech durations effectively. In this paper, we propose DurFlex-EVC, a duration-flexible EVC framework that operates without the need for text or alignment information. We introduce a unit aligner that models contextual information by aligning speech with discrete units representing content, eliminating the need for text or speech-text alignment. Additionally, we design a style autoencoder that effectively disentangles content and emotional style, allowing precise manipulation of the emotional characteristics of the speech. We further enhance emotional expressiveness through a hierarchical stylize encoder that applies the target emotional style at multiple hierarchical levels, refining the stylization process to improve the naturalness and expressiveness of the converted speech. Experimental results from subjective and objective evaluations demonstrate that our approach outperforms baseline models, effectively handling duration variability and enhancing emotional expressiveness in the converted speech.




Comparative Model

Ablation Study

Durflex-EVC: proposed model
Durflex-EVC (w/o SAE): without style autoencoder (SAE)
Durflex-EVC (w/o UA): without unit aligner (UA)
Durflex-EVC (w/o HSE): without hierarchical stylize encoder (HSE)
Durflex-EVC (w/ DDP): using deterministic duration predictor (DDP) instead of stochastic duration predictor
Durflex-EVC (w/ FFT): using feed-forward transformer (FFT)-based decoder instead of diffusion model
Durflex-EVC (w/ adv): using adversarial training for style disentanglement instead of SAE
Durflex-EVC (w/ unit2mel): generating Mel-spectrogram from unit
Durflex-EVC (w/ unti2wav): generating waveform from unit

Comparison of results based on input features

Durflex-EVC (w/ Mel-spec.): using Mel-spectrogram
Durflex-EVC (w/ linear spec.): using linear spectrogram
Durflex-EVC (w/ wav2vec 2.0): using wav2vec 2.0 [6] representations
Durflex-EVC (w/o wavLM): using wavLM [7] representations
DurFlex-EVC (w/o HuBERT): using HuBERT [8] representations

Speaker encoder setting for zero-shot emotion conversion


Reference

[1] G. Rizos, A. Baird, M. Elliott, and B. Schuller, “Stargan for Emotional Speech Conversion: Validated by Data Augmentation of End-To-End Emotion Recognition,” in IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 3502-3506.

[2] K. Zhou, B. Sisman, and H. Li, “Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-Stage Sequence-to-Sequence Training,” in Proc. Interspeech, 2021, pp. 811-815.

[3] K. Zhou, B. Sisman, R. Rana, B. W. Schuller, and H. Li, “Emotion Intensity and its Control for Emotional Voice Conversion,” IEEE Trans. Affect. Comput., vol. 14, no. 1, pp. 31-48, 2023.

[4] K. Zhou, B. Sisman, R. Rana, B. W. Schuller, and H. Li, “Speech Synthesis with Mixed Emotions,” IEEE Trans. Affect. Comput., pp. 1-16, 2022.

[5] F. Kreuk, A. Polyak, J. Copet, E. Kharitonov, T. A. Nguyen, M. Riviere, W.-N. Hsu, A. Mohamed, E. Dupoux, and ` Y. Adi, “Textless Speech Emotion Conversion using Discrete & Decomposed Representations,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 11 200-11 214.

[6] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33. Curran Associates, Inc., 2020, pp. 12 449-12 460.

[7] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1505-1518, 2022.

[8] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3451-3460, 2021.