当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Whispered and Lombard Neural Speech Synthesis
arXiv - CS - Sound Pub Date : 2021-01-13 , DOI: arxiv-2101.05313
Qiong Hu, Tobias Bleisch, Petko Petkov, Tuomo Raitio, Erik Marchi, Varun Lakshminarasimhan

It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1) Pre-training and fine-tuning a model for each style. 2) Lombard and whisper speech conversion through a signal processing based approach. 3) Multi-style generation using a single model based on a speaker verification model. Our mean opinion score and AB preference listening tests show that 1) we can generate high quality speech through the pre-training/fine-tuning approach for all speaking styles. 2) Although our speaker verification (SV) model is not explicitly trained to discriminate different speaking styles, and no Lombard and whisper voice is used for pre-training this system, the SV model can be used as a style encoder for generating different style embeddings as input for the Tacotron system. We also show that the resulting synthetic Lombard speech has a significant positive impact on intelligibility gain.

中文翻译:

耳语和伦巴第神经语音合成

文本到语音系统期望考虑到呈现合成语音的环境,并向用户提供适当的上下文相关输出。在本文中,我们提出并比较了仅使用有限的数据来生成不同讲话风格的各种方法,即正常,伦巴第和耳语。提出并评估了以下系统:1)对每种样式的模型进行预训练和微调。2)通过基于信号处理的方法进行朗伯和耳语转换。3)使用基于说话者验证模型的单个模型进行多样式生成。我们的平均意见分数和AB偏好聆听测试表明:1)我们可以通过针对所有说话风格的预训练/微调方法来生成高质量的语音。2)尽管我们的说话人验证(SV)模型没有经过明确训练以区分不同的讲话风格,并且没有使用Lombard和耳语语音来预先训练该系统,但是SV模型可以用作样式编码器来生成不同的样式嵌入作为Tacotron系统的输入。我们还表明,所得的合成伦巴第语音对清晰度的提高具有显着的积极影响。
更新日期:2021-01-15
down
wechat
bug