EmoQ-TTS: Emotion intensity Quantization for Fine-grained Controllable Emotional Text-to-Speech

Chae-Bin Im, Sang-Hoon Lee, Seung-Bin Kim, Seong-Whan Lee

ABSTRACT

Although recent advances in text-to-speech (TTS) have shown significant improvement, it is still limited to emotional speech synthesis. To produce emotional speech, most works utilize emotion information extracted from emotion labels or reference audio. However, they result in monotonous emotional expression due to the utterance-level emotion conditions. In this paper, we propose EmoQ-TTS, which synthesizes expressive emotional speech by conditioning phoneme-wise emotion information with fine-grained emotion intensity. Here, the intensity of emotion information is rendered by distance-based intensity quantization without human labeling. We can also control the emotional expression of synthesized speech by conditioning intensity labels manually. The experimental results demonstrate the superiority of EmoQ-TTS in emotional expressiveness and controllability.

Emotion Control

Script : 그러나 이제는 이런 단순대결구도에만 집착해서는 안됩니다.
Translation : But now we shouldn't stick to this simple confrontation structure.
Pronunciation : Geuleona ijeneun ileon dansundaegyeolgudo-eman jibchaghaeseoneun andoebnida.
Emotion
Intensity
0      1



single-speaker TTS

All the audio sample use FreGAN as a vocoder.

Script                      :   기쁘게 일하고 해 놓은 일을 기뻐하는 사람은 행복하다.
Translation        :   Those who work happily and are happy with what they have done are happy.
Pronunciation :   Gippeuge ilhago hae noh-eun il-eul gippeohaneun salam-eun haengboghada.
Emotion               :   Happy

Ground Truth

Vocoded

TP-GST

FEP

EmoQ-TTS

Script                      :   그 점을 염두에 두고 국교정상화를 받아들인다면 성급한 생각이다.
Translation        :   It is a hasty idea to accept the normalization of diplomatic relations with that in mind.
Pronunciation :   Geu jeom-eul yeomdue dugo guggyojeongsanghwaleul bad-adeul-indamyeon seong-geubhan saeng-gag-ida.
Emotion               :   Sad

Ground Truth

Vocoded

TP-GST

FEP

EmoQ-TTS

Script                      :   그렇게 항상 음식을 남기다 벌 받으면 어쩌려구.
Translation        :   What if you get punished for leaving food like that?
Pronunciation :   Geuleohge hangsang eumsig-eul namgida beol bad-eumyeon eojjeolyeogu.
Emotion               :   Angry

Ground Truth

Vocoded

TP-GST

FEP

EmoQ-TTS

Script                      :   돈을 받고 피를 팔던 매혈자가 국립보건원 검사결과 후천성면역결핍증 감염자로 밝혀졌다.
Translation        :   A blood-pulling person who was selling blood for money was found to be infected with acquired immunodeficiency disease as a result of a test conducted by the National Institutes of Health.
Pronunciation :   Don-eul badgo pileul paldeon maehyeoljaga guglibbogeon-won geomsagyeolgwa hucheonseongmyeon-yeoggyeolpibjeung gam-yeomjalo balghyeojyeossda.
Emotion               :   Surprised

Ground Truth

Vocoded

TP-GST

FEP

EmoQ-TTS

Script                      :   기대를 가지고 파칭코에 갔지만 돈을 하나도 따지 못했다.
Translation        :   I went to Pachinko with anticipation, but I didn't win any money.
Pronunciation :   Gidaeleul gajigo pachingko-e gassjiman don-eul hanado ttaji moshaessda.
Emotion               :   Fear

Ground Truth

Vocoded

TP-GST

FEP

EmoQ-TTS

Script                      :   친구에게 큰 돈을 빌려달라고 부탁했는데 고맙게도 선뜻 빌려주었다.
Translation        :   I asked my friend to lend me a lot of money, but thankfully, he lent it.
Pronunciation :   Chinguege keun don-eul billyeodallago butaghaessneunde gomabgedo seontteus billyeojueossda.
Emotion               :   Disgusted

Ground Truth

Vocoded

TP-GST

FEP

EmoQ-TTS

Multi-speaker TTS

All the audio sample use FreGAN as a vocoder.

Script                      :   엄청나게 배고팠는데 이제야 살 것 같아.
Translation        :   I was really hungry, but now I feel like I'm going to live.
Pronunciation :   Eomcheongnage baegopassneunde ijeya sal geos gat-a.
Speaker               :   Female (nec)
Emotion               :   Happy

Ground Truth

Vocoded

TP-GST

FEP

EmoQ-TTS

Script                      :   흥겨워서 노래가 절로 나는구먼!
Translation        :   I can't help but sing because I'm so excited.
Pronunciation :   Heung-gyeowoseo nolaega jeollo naneungumeon!
Speaker               :   Male (nel)
Emotion               :   Happy

Ground Truth

Vocoded

TP-GST

FEP

EmoQ-TTS

Script                      :   신축성이 강한 긴 장대를 쥐고 도움닫기를 하여 가로대를 넘는 종목이다.
Translation        :   It is an event that crosses the street pole by holding a long pole with strong elasticity and running.
Pronunciation :   Sinchugseong-i ganghan gin jangdaeleul jwigo doumdadgileul hayeo galodaeleul neomneun jongmog-ida.
Speaker               :   Female (emb)
Emotion               :   Sad

Ground Truth

Vocoded

TP-GST

FEP

EmoQ-TTS

Script                      :   주로 물이 차가운 곳에 서식하며 밤에는 작은 물고기나 수생곤충 등을 잡아먹습니다.
Translation        :   It mainly lives in cold places and eats small fish or aquatic insects at night.
Pronunciation :   Julo mul-i chagaun gos-e seosighamyeo bam-eneun jag-eun mulgogina susaeng-gonchung deung-eul jab-ameogseubnida.
Speaker               :   Male (emf)
Emotion               :   Sad

Ground Truth

Vocoded

TP-GST

FEP

EmoQ-TTS

Script                      :   건물주의 오만하고 뻔뻔한 태도에 기가 찹니다.
Translation        :   I'm amazed by the arrogant and shameless attitude of the building owner.
Pronunciation :   Geonmuljuui omanhago ppeonppeonhan taedo-e giga chabnida.
Speaker               :   Female (nec)
Emotion               :   Angry

Ground Truth

Vocoded

TP-GST

FEP

EmoQ-TTS

Script                      :   점심 메뉴를 본인 마음대로 결정하는 상사를 볼 때마다 정말 못마땅해.
Translation        :   Whenever I see my boss deciding on the lunch menu at his disposal, I don't like it.
Pronunciation :   Jeomsim menyuleul bon-in ma-eumdaelo gyeoljeonghaneun sangsaleul bol ttaemada jeongmal mosmattanghae.
Speaker               :   Male (nek)
Emotion               :   Angry

Ground Truth

Vocoded

TP-GST

FEP

EmoQ-TTS

Ablation Study

Components of reference encoder

Script                      :   니가 뿌린 향수 참 달콤하고 좋은 향이 나.
Translation        :   The perfume you sprayed smells so sweet and nice.
Pronunciation :   Niga ppulin hyangsu cham dalkomhago joh-eun hyang-i na.
Emotion               :   Disgusted

w/o Both Clssifier and Predictors

w/o Phoneme Clssifier

w/o Auxiliary Predictors

EmoQ-TTS

Projection method

Script                      :   형이 말하는 걸 들어보면 참 근사해.
Translation        :   It's really cool to hear what you're saying.
Pronunciation :   Hyeong-i malhaneun geol deul-eobomyeon cham geunsahae.
Emotion               :   Happy

L1 distance w/o projection

PCA projection

EmoQ-TTS (LDA)

Intensity quantization

Script                      :   그러나 이제는 이런 단순대결구도에만 집착해서는 안됩니다.
Translation        :   But now we shouldn't stick to this simple confrontation structure.
Pronunciation :   Geuleona ijeneun ileon dansundaegyeolgudo-eman jibchaghaeseoneun andoebnida.
Emotion               :   Surprised

w/o intensity quantization

EmoQ-TTS

Vocoder Finetuning

All the audio sample use FreGAN as a vocoder.

Script                      :   노란빛 파라솔 밑에서 그녀는 웃는다.
Translation        :   Under the yellow parasol, she laughs.
Pronunciation :   nolanbich palasol mit-eseo geunyeoneun usneunda.
Speaker               :   Single
Emotion               :   Happy

Predicted

Control (weak)

Control (medium)

Control (strong)

Control (increasing)

Control (decreasing)

Script                      :   공급체계의 합리화도 요긴한 과제다.
Translation        :   The rationalization of the supply system is also a useful task.
Pronunciation :   gong-geubchegyeui hablihwado yoginhan gwajeda.
Speaker               :   Single
Emotion               :   Sad

Predicted

Control (weak)

Control (medium)

Control (strong)

Control (increasing)

Control (decreasing)

Script                      :   북한이 이처럼 남한 적화가 가능하다는 희망을 갖고 있는 한 대화는 대남 혁명노선의 한 방편일 뿐입니다.
Translation        :   As long as North Korea has the hope that South Korea can be reinforced, dialogue is only one way to revolutionize the South.
Pronunciation :   bughan-i icheoleom namhan jeoghwaga ganeunghadaneun huimang-eul gajgo issneun han daehwaneun daenam hyeogmyeongnoseon-ui han bangpyeon-il ppun-ibnida.
Speaker               :   Single
Emotion               :   Angry

Predicted

Control (weak)

Control (medium)

Control (strong)

Control (increasing)

Control (decreasing)

Script                      :   금융실명제가 실시되면 기존 경제질서에 충격을 줄 뿐 아니라 지금의 해외유출 등 여러가지 부작용이 예상되는데요.
Translation        :   If the real-name financial system is implemented, it will not only impact the existing economic order, but also have various side effects such as the current overseas outflow.
Pronunciation :   geum-yungsilmyeongjega silsidoemyeon gijon gyeongjejilseoe chung-gyeog-eul jul ppun anila jigeum-ui haeoeyuchul deung yeoleogaji bujag-yong-i yesangdoeneundeyo.
Speaker               :   Single
Emotion               :   Surprised

Predicted

Control (weak)

Control (medium)

Control (strong)

Control (increasing)

Control (decreasing)

Script                      :   콘텍트 렌즈를 오래 착용했더니 눈이 빨갛게 충혈되었다.
Translation        :   After wearing contact lenses for a long time, my eyes were red.
Pronunciation :   kontegteu lenjeuleul olae chag-yonghaessdeoni nun-i ppalgahge chunghyeoldoeeossda.
Speaker               :   Single
Emotion               :   Fear

Predicted

Control (weak)

Control (medium)

Control (strong)

Control (increasing)

Control (decreasing)

Script                      :   유럽공동체는 오는 구십이년이면 통합을 완료 완전히 단일 시장화 될 예정인데 한국기업들은 어떻게 대처해야 할 것인지.
Translation        :   The European community will complete the integration in the next ninety-nine years and become a completely single market, and how Korean companies will cope with it.
Pronunciation :   yuleobgongdongcheneun oneun gusib-inyeon-imyeon tonghab-eul wanlyo wanjeonhi dan-il sijanghwa doel yejeong-inde hanguggieobdeul-eun eotteohge daecheohaeya hal geos-inji.
Speaker               :   Single
Emotion               :   Disgusted

Predicted

Control (weak)

Control (medium)

Control (strong)

Control (increasing)

Control (decreasing)

Script                      :   천구백사십사년 여름까지 영국에서 계속 훈련을 받았으며, 병력은 약 사천 명까지 늘어났다.
Translation        :   He continued to train in England until the summer of the nineteen forty-four years, and the number of troops increased to about four thousand.
Pronunciation :   cheongubaegsasibsanyeon yeoleumkkaji yeong-gug-eseo gyesog hunlyeon-eul bad-ass-eumyeo, byeonglyeog-eun yag sacheon myeongkkaji neul-eonassda.
Speaker               :   Female (ema)
Emotion               :   Happy

Predicted

Control (weak)

Control (medium)

Control (strong)

Control (increasing)

Control (decreasing)

Script                      :   너와 함께 하는 모든 시간이 즐거워!
Translation        :   All the time I spend with you is fun!
Pronunciation :   neowa hamkke haneun modeun sigan-i jeulgeowo!
Speaker               :   Male (nem)
Emotion               :   Happy

Predicted

Control (weak)

Control (medium)

Control (strong)

Control (increasing)

Control (decreasing)

Script                      :   사랑하지만 보내줘야 하는 그녀 앞에서 눈물이 나려는 걸 몇 번이고 참았어.
Translation        :   I held back tears several times in front of her, who I love but had to let her go.
Pronunciation :   salanghajiman bonaejwoya haneun geunyeo ap-eseo nunmul-i nalyeoneun geol myeoch beon-igo cham-ass-eo.
Speaker               :   Female (neb)
Emotion               :   Sad

Predicted

Control (weak)

Control (medium)

Control (strong)

Control (increasing)

Control (decreasing)

Script                      :   낙담하지 말라는 말을 들어도 어쩔 수가 없는걸.
Translation        :   I can't help it even if I'm told not to be discouraged.
Pronunciation :   nagdamhaji mallaneun mal-eul deul-eodo eojjeol suga eobsneungeol.
Speaker               :   Male (nen)
Emotion               :   Sad

Predicted

Control (weak)

Control (medium)

Control (strong)

Control (increasing)

Control (decreasing)

Script                      :   내 동생은 내가 다이어트만 시작하면 야식을 시켜 먹는다니까, 왠수가 따로 없어.
Translation        :   My brother orders late-night snacks whenever I start dieting, so there's no other reason.
Pronunciation :   nae dongsaeng-eun naega daieoteuman sijaghamyeon yasig-eul sikyeo meogneundanikka, waensuga ttalo eobs-eo.
Speaker               :   Female (ned)
Emotion               :   Angry

Predicted

Control (weak)

Control (medium)

Control (strong)

Control (increasing)

Control (decreasing)

Script                      :   주로 호수에서 물고기를 잡거나 수경 재배로 야채나 방울토마토를 생산합니다.
Translation        :   They mainly produce vegetables or cherry tomatoes by catching fish in lakes or cultivating hydroponics.
Pronunciation :   julo hosueseo mulgogileul jabgeona sugyeong jaebaelo yachaena bang-ultomatoleul saengsanhabnida.
Speaker               :   Male (emf)
Emotion               :   Angry

Predicted

Control (weak)

Control (medium)

Control (strong)

Control (increasing)

Control (decreasing)

Other Language Test

The vocoder is currently training (20k).
The samples of well-trained vocoder will be updated soon.

Script                      :   The football teams give a tea party.
Speaker               :   Female (0015)
Emotion               :   Happy

Predicted

Control (weak)

Control (medium)

Control (strong)

Control (increasing)

Control (decreasing)

Script                      :   All my gum tips gone as well.
Speaker               :   Male (0020)
Emotion               :   Angry

Predicted

Control (weak)

Control (medium)

Control (strong)

Control (increasing)

Control (decreasing)

Script                      :   I chose the right way.
Speaker               :   Female (0016)
Emotion               :   Sad

Predicted

Control (weak)

Control (medium)

Control (strong)

Control (increasing)

Control (decreasing)