Voice Conversion for Whispered Speech Synthesis,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Voice Conversion for Whispered Speech Synthesis
arXiv - CS - Sound Pub Date : 2019-12-11 , DOI: arxiv-1912.05289
Marius Cotescu, Thomas Drugman, Goeric Huybrechts, Jaime Lorenzo-Trueba, Alexis Moinet

We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech. We investigate using Gaussian Mixture Models (GMM) and Deep Neural Networks (DNN) to model the mapping between acoustic features of normal speech and those of whispered speech. We evaluate naturalness and speaker similarity of the converted whisper on an internal corpus and on the publicly available wTIMIT corpus. We show that applying VC techniques is significantly better than using rule-based signal processing methods and it achieves results that are indistinguishable from copy-synthesis of natural whisper recordings. We investigate the ability of the DNN model to generalize on unseen speakers, when trained with data from multiple speakers. We show that excluding the target speaker from the training set has little or no impact on the perceived naturalness and speaker similarity of the converted whisper. The proposed DNN method is used in the newly released Whisper Mode of Amazon Alexa.

中文翻译：

用于耳语合成的语音转换

我们提出了一种合成耳语的方法，通过应用手工制作的信号处理配方和语音转换 (VC) 技术将正常发声的语音转换为耳语。我们研究使用高斯混合模型 (GMM) 和深度神经网络 (DNN) 对正常语音的声学特征与低声语音的声学特征之间的映射进行建模。我们在内部语料库和公开可用的 wTIMIT 语料库上评估转换后的耳语的自然度和说话者相似度。我们表明，应用 VC 技术明显优于使用基于规则的信号处理方法，并且它获得的结果与自然耳语录音的复制合成无法区分。我们研究了 DNN 模型在使用来自多个说话者的数据进行训练时对看不见的说话者进行泛化的能力。我们表明，从训练集中排除目标说话者对转换后的耳语的感知自然度和说话者相似性几乎没有影响。提议的 DNN 方法用于亚马逊 Alexa 新发布的 Whisper Mode。

更新日期：2020-01-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>