How Far Are We from Robust Voice Conversion: A Survey,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

How Far Are We from Robust Voice Conversion: A Survey
arXiv - CS - Sound Pub Date : 2020-11-24 , DOI: arxiv-2011.12063
Tzu-hsien Huang, Jheng-hao Lin, Chien-yu Huang, Hung-yi Lee

Voice conversion technologies have been greatly improved in recent years with the help of deep learning, but their capabilities of producing natural sounding utterances in different conditions remain unclear. In this paper, we gave a thorough study of the robustness of known VC models. We also modified these models, such as the replacement of speaker embeddings, to further improve their performances. We found that the sampling rate and audio duration greatly influence voice conversion. All the VC models suffer from unseen data, but AdaIN-VC is relatively more robust. Also, the speaker embedding jointly trained is more suitable for voice conversion than those trained on speaker identification.

中文翻译：

我们距离健壮的语音转换有多远：一项调查

近年来，在深度学习的帮助下，语音转换技术得到了极大的改进，但是在不同条件下产生自然发声的能力仍然不清楚。在本文中，我们对已知VC模型的鲁棒性进行了深入研究。我们还修改了这些模型，例如替换了扬声器嵌入，以进一步提高其性能。我们发现采样率和音频持续时间极大地影响了语音转换。所有的VC模型都有看不见的数据，但是AdaIN-VC相对更健壮。此外，与经过说话人识别训练的人相比，共同训练的说话人嵌入更适合语音转换。

更新日期：2020-11-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文