Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data
arXiv - CS - Sound Pub Date : 2020-02-01 , DOI: arxiv-2002.00198
Kun Zhou, Berrak Sisman, Haizhou Li

Emotional voice conversion aims to convert the spectrum and prosody to change the emotional patterns of speech, while preserving the speaker identity and linguistic content. Many studies require parallel speech data between different emotional patterns, which is not practical in real life. Moreover, they often model the conversion of fundamental frequency (F0) with a simple linear transform. As F0 is a key aspect of intonation that is hierarchical in nature, we believe that it is more adequate to model F0 in different temporal scales by using wavelet transform. We propose a CycleGAN network to find an optimal pseudo pair from non-parallel training data by learning forward and inverse mappings simultaneously using adversarial and cycle-consistency losses. We also study the use of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales, that describes speech prosody at different time resolution, for effective F0 conversion. Experimental results show that our proposed framework outperforms the baselines both in objective and subjective evaluations.

中文翻译：

使用非并行训练数据转换情感语音转换的频谱和韵律

情感语音转换旨在转换频谱和韵律以改变语音的情感模式，同时保留说话人的身份和语言内容。许多研究需要不同情绪模式之间的并行语音数据，这在现实生活中是不切实际的。此外，他们经常使用简单的线性变换对基频 (F0) 的转换进行建模。由于 F0 是本质上分层的语调的一个关键方面，我们认为使用小波变换在不同时间尺度上对 F0 进行建模更合适。我们提出了一个 CycleGAN 网络，通过使用对抗性和循环一致性损失同时学习正向和反向映射，从非平行训练数据中找到最佳伪对。我们还研究了使用连续小波变换 (CWT) 将 F0 分解为十个时间尺度，描述不同时间分辨率的语音韵律，以实现有效的 F0 转换。实验结果表明，我们提出的框架在客观和主观评估中都优于基线。

更新日期：2020-10-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文