Controllable neural text-to-speech synthesis using intuitive prosodic features,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Controllable neural text-to-speech synthesis using intuitive prosodic features
arXiv - CS - Sound Pub Date : 2020-09-14 , DOI: arxiv-2009.06775
Tuomo Raitio, Ramya Rasipuram, Dan Castellani

Modern neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the prosody of generated utterances often represents the average prosodic style of the database instead of having wide prosodic variation. Moreover, the generated prosody is solely defined by the input text, which does not allow for different styles for the same sentence. In this work, we train a sequence-to-sequence neural network conditioned on acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions. Experiments show that a model conditioned on sentence-wise pitch, pitch range, phone duration, energy, and spectral tilt can effectively control each prosodic dimension and generate a wide variety of speaking styles, while maintaining similar mean opinion score (4.23) to our Tacotron baseline (4.26).

中文翻译：

使用直观韵律特征的可控神经文本到语音合成

现代神经文本到语音 (TTS) 合成可以生成与自然语音无法区分的语音。然而，生成的话语的韵律通常代表数据库的平均韵律风格，而不是具有广泛的韵律变化。此外，生成的韵律完全由输入文本定义，不允许同一个句子有不同的风格。在这项工作中，我们训练了一个以声学语音特征为条件的序列到序列神经网络，以学习具有直观和有意义维度的潜在韵律空间。实验表明，以句子的音高、音高范围、音调持续时间、能量和频谱倾斜为条件的模型可以有效地控制每个韵律维度并生成多种说话风格，同时保持相似的平均意见得分 (4.

更新日期：2020-09-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文