当前位置: X-MOL 学术Speech Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Acoustic model-based subword tokenization and prosodic-context extraction without language knowledge for text-to-speech synthesis
Speech Communication ( IF 3.2 ) Pub Date : 2020-09-24 , DOI: 10.1016/j.specom.2020.09.003
Masashi Aso , Shinnosuke Takamichi , Norihiro Takamune , Hiroshi Saruwatari

This paper presents text tokenization and context extraction without using language knowledge for text-to-speech (TTS) synthesis. To generate prosody, statistical parametric TTS synthesis typically requires the professional knowledge of the target language. Therefore, languages suitable for TTS synthesis are limited to only rich-resource languages. To achieve TTS synthesis without using language knowledge, we propose acoustic model-based subword tokenization and unsupervised extraction of prosodic contexts. The subword tokenization can determine language units suitable for prosody generation. The context extraction can retrieve contexts from pairs of subwords and prosody. The proposed methods function without language knowledge and can improve F0 prediction accuracy. Experimental evaluation demonstrates that 1) the training of proposed subword tokenization, which uses the expectation-maximization algorithm and deep neural networks, is empirically stable, 2) the proposed subword tokenization tokenizes text into subwords that are close to language-specific units, and 3) the proposed methods outperform the conventional methods using language model-based tokenization in terms of synthetic speech quality.



中文翻译:

基于语音模型的子词标记化和韵律上下文提取,无需语言知识即可进行语音合成

本文提出了不使用语言知识进行文本到语音(TTS)合成的文本标记和上下文提取。为了产生韵律,统计参数TTS合成通常需要目标语言的专业知识。因此,适用于TTS合成的语言仅限于资源丰富的语言。为了在不使用语言知识的情况下实现TTS综合,我们提出了基于声学模型的子词标记化和韵律情境的无监督提取。子词标记化可以确定适合于韵律生成的语言单元。上下文提取可以从子词和韵律对中检索上下文。所提出的方法在没有语言知识的情况下起作用并且可以提高F0预测精度。

更新日期:2020-10-11
down
wechat
bug