当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
EmoCat: Language-agnostic Emotional Voice Conversion
arXiv - CS - Sound Pub Date : 2021-01-14 , DOI: arxiv-2101.05695
Bastian Schnell, Goeric Huybrechts, Bartek Perz, Thomas Drugman, Jaime Lorenzo-Trueba

Emotional voice conversion models adapt the emotion in speech without changing the speaker identity or linguistic content. They are less data hungry than text-to-speech models and allow to generate large amounts of emotional data for downstream tasks. In this work we propose EmoCat, a language-agnostic emotional voice conversion model. It achieves high-quality emotion conversion in German with less than 45 minutes of German emotional recordings by exploiting large amounts of emotional data in US English. EmoCat is an encoder-decoder model based on CopyCat, a voice conversion system which transfers prosody. We use adversarial training to remove emotion leakage from the encoder to the decoder. The adversarial training is improved by a novel contribution to gradient reversal to truly reverse gradients. This allows to remove only the leaking information and to converge to better optima with higher conversion performance. Evaluations show that Emocat can convert to different emotions but misses on emotion intensity compared to the recordings, especially for very expressive emotions. EmoCat is able to achieve audio quality on par with the recordings for five out of six tested emotion intensities.

中文翻译:

EmoCat:与语言无关的情感语音转换

情感语音转换模型可以适应语音中的情感,而无需更改说话者的身份或语言内容。与文本语音转换模型相比,它们不那么耗费数据,并且可以为下游任务生成大量情感数据。在这项工作中,我们提出了EmoCat,这是一种与语言无关的情感语音转换模型。它利用大量的美国英语情感数据,用不到45分钟的德国情感记录就可以实现德语中的高质量情感转换。EmoCat是基于CopyCat的编解码器模型,CopyCat是一种传输韵律的语音转换系统。我们使用对抗训练来消除从编码器到解码器的情感泄漏。对抗训练通过对梯度逆转到真正逆梯度的新颖贡献而得到改进。这允许仅删除泄漏的信息,并以更高的转换性能收敛到更好的最佳状态。评估显示,与录音相比,Emocat可以转换成不同的情感,但错过了情感强度,尤其是对于表现力很强的情感。EmoCat能够在六种测试的情绪强度中获得五种与录音相当的音频质量。
更新日期:2021-01-15
down
wechat
bug