Hyunseung Chung, Sang-Hoon Lee, Seong-Whan Lee
Text-to-speech (TTS) synthesis is the process of producing synthesized speech from text or phoneme input. Traditional TTS models contain multiple processing steps and require external aligners for training. As the complexity increases and efficiency decreases with every additional step, there is expanding demand in modern synthesis pipelines for end-to-end TTS with efficient internal aligners. In this work, we propose an end-to-end text-to-waveform network with a novel reinforcement learning based duration search method. Our proposed generator is feed-forward and the aligner trains the agent to make optimal duration predictions by receiving active feedback from actions taken to maximize cumulative reward. We demonstrate accurate alignments of phoneme-to-mel sequence generated from trained agents enhance fidelity and naturalness of synthesized audio. Experimental results also show the superiority of our proposed model compared to other state-of-the-art TTS models with internal and external aligners.
Script : He was a notorious criminal.
|
||||
---|---|---|---|---|
Ground Truth |
Vocoded (HiFi-GAN) |
Tacotron2 |
||
BVAE-TTS |
Glow-TTS |
Reinforce-Aligner (Proposed) |
|
Script : In the field of preventive investigation in regard to the President's security.
|
||||
---|---|---|---|---|
Ground Truth |
Vocoded (HiFi-GAN) |
Tacotron2 |
||
BVAE-TTS |
Glow-TTS |
Reinforce-Aligner (Proposed) |
|
Script : While Dr. Carrico went on to attend the President, Dr. Dulany stayed with the Governor and was soon joined by several other doctors.
|
||||
---|---|---|---|---|
Ground Truth |
Vocoded (HiFi-GAN) |
Tacotron2 |
||
BVAE-TTS |
Glow-TTS |
Reinforce-Aligner (Proposed) |
|
Script : The Commission evaluated the physical evidence found near the window after the assassination and the testimony of eyewitnesses.
|
||||
---|---|---|---|---|
Ground Truth |
Vocoded (HiFi-GAN) |
Tacotron2 |
||
BVAE-TTS |
Glow-TTS |
Reinforce-Aligner (Proposed) |
|
Script : The man in the lobby, quote, looked over his shoulder and turned around and walked up West Jefferson towards the theatre, end quote.
|
||||
---|---|---|---|---|
Ground Truth |
Vocoded (HiFi-GAN) |
Tacotron2 |
||
BVAE-TTS |
Glow-TTS |
Reinforce-Aligner (Proposed) |
|
Korean Multi-Speaker Dataset (NIKL Corpus)
|
||||
---|---|---|---|---|
Ground Truth |
w/ Target Duration |
w/o DTW |
w/ DTW |
|
Segment-wise 1 |
Segment-wise 2 |
Phoneme-wise 1 |
Phoneme-wise 2 |
Ground Truth |
w/ Target Duration |
w/o DTW |
w/ DTW |
|
Segment-wise 1 |
Segment-wise 2 |
Phoneme-wise 1 |
Phoneme-wise 2 |
Ground Truth |
w/ Target Duration |
w/o DTW |
w/ DTW |
|
Segment-wise 1 |
Segment-wise 2 |
Phoneme-wise 1 |
Phoneme-wise 2 |
Ground Truth |
w/ Target Duration |
w/o DTW |
w/ DTW |
|
Segment-wise 1 |
Segment-wise 2 |
Phoneme-wise 1 |
Phoneme-wise 2 |
Ground Truth |
w/ Target Duration |
w/o DTW |
w/ DTW |
|
Segment-wise 1 |
Segment-wise 2 |
Phoneme-wise 1 |
Phoneme-wise 2 |