M2-CTTS: END-TO-END MULTI-SCALE MULTI-MODAL CONVERSATIONAL TEXT-TO-SPEECH SYNTHESIS

Authors:

Teresa Matoso Manguangua Victor, Miguel João Manuel Filho, Alan Claude Award, Kunal, Rishi Matura, Hitesh

Page No: 27-31

Abstract:

Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important finegrained information like keywords and emphasis. More- over, it is insufficient to only consider the textual features, and acoustic features also contain various prosody information. Hence, we propose M2-CTTS, an end- to-end multi-scale multi-modal conversational text-to- speech system, aiming to comprehensively utilize historical conversation and enhance prosodic expression. More specifically, we design a textual context module and an acoustic context module with both coarse-grained and fine- grained modeling. Experimental re- sults demonstrate that our model mixed with fine-grained context information and additionally considering acoustic features achieves better prosody performance and naturalnessin CMOS tests

Description:

speech synthesis, conversational TTS, prosody, multi-grained, multi-modal

Volume & Issue

Volume-12,ISSUE-11

Keywords

.