Audio demos

ABSTRACT

Text-to-speech (TTS) synthesis is the process of producing synthesized speech from text or phoneme input. Traditional TTS models contain multiple processing steps and require external aligners for training. As the complexity increases and efficiency decreases with every additional step, there is expanding demand in modern synthesis pipelines for end-to-end TTS with efficient internal aligners. In this work, we propose an end-to-end text-to-waveform network with a novel reinforcement learning based duration search method. Our proposed generator is feed-forward and the aligner trains the agent to make optimal duration predictions by receiving active feedback from actions taken to maximize cumulative reward. We demonstrate accurate alignments of phoneme-to-mel sequence generated from trained agents enhance fidelity and naturalness of synthesized audio. Experimental results also show the superiority of our proposed model compared to other state-of-the-art TTS models with internal and external aligners.

Audio samples

Script : He was a notorious criminal.
Ground Truth	Vocoded (HiFi-GAN)	Tacotron2
BVAE-TTS	Glow-TTS	Reinforce-Aligner (Proposed)

Script : In the field of preventive investigation in regard to the President's security.
Ground Truth	Vocoded (HiFi-GAN)	Tacotron2
BVAE-TTS	Glow-TTS	Reinforce-Aligner (Proposed)

Script : While Dr. Carrico went on to attend the President, Dr. Dulany stayed with the Governor and was soon joined by several other doctors.
Ground Truth	Vocoded (HiFi-GAN)	Tacotron2
BVAE-TTS	Glow-TTS	Reinforce-Aligner (Proposed)

Script : The Commission evaluated the physical evidence found near the window after the assassination and the testimony of eyewitnesses.
Ground Truth	Vocoded (HiFi-GAN)	Tacotron2
BVAE-TTS	Glow-TTS	Reinforce-Aligner (Proposed)

Script : The man in the lobby, quote, looked over his shoulder and turned around and walked up West Jefferson towards the theatre, end quote.
Ground Truth	Vocoded (HiFi-GAN)	Tacotron2
BVAE-TTS	Glow-TTS	Reinforce-Aligner (Proposed)

Ablation studies

Korean Multi-Speaker Dataset (NIKL Corpus)
Ground Truth	w/ Target Duration	w/o DTW	w/ DTW
Segment-wise 1	Segment-wise 2	Phoneme-wise 1	Phoneme-wise 2


Ground Truth	w/ Target Duration	w/o DTW	w/ DTW
Segment-wise 1	Segment-wise 2	Phoneme-wise 1	Phoneme-wise 2


Ground Truth	w/ Target Duration	w/o DTW	w/ DTW
Segment-wise 1	Segment-wise 2	Phoneme-wise 1	Phoneme-wise 2


Ground Truth	w/ Target Duration	w/o DTW	w/ DTW
Segment-wise 1	Segment-wise 2	Phoneme-wise 1	Phoneme-wise 2


Ground Truth	w/ Target Duration	w/o DTW	w/ DTW
Segment-wise 1	Segment-wise 2	Phoneme-wise 1	Phoneme-wise 2