StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
arXiv - CS - Sound Pub Date : 2021-07-21 , DOI: arxiv-2107.10394
Yinghao Aaron Li, Ali Zare, Nima Mesgarani

We present an unsupervised non-parallel many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2. Using a combination of adversarial source classifier loss and perceptual loss, our model significantly outperforms previous VC models. Although our model is trained only with 20 English speakers, it generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion. Using a style encoder, our framework can also convert plain reading speech into stylistic speech, such as emotional and falsetto speech. Subjective and objective evaluation experiments on a non-parallel many-to-many voice conversion task revealed that our model produces natural sounding voices, close to the sound quality of state-of-the-art text-to-speech (TTS) based voice conversion methods without the need for text labels. Moreover, our model is completely convolutional and with a faster-than-real-time vocoder such as Parallel WaveGAN can perform real-time voice conversion.

中文翻译：

StarGANv2-VC：一种用于自然声音转换的多样化、无监督、非并行框架

我们提出了一种使用称为 StarGAN v2 的生成对抗网络 (GAN) 的无监督非并行多对多语音转换 (VC) 方法。使用对抗性源分类器损失和感知损失的组合，我们的模型明显优于以前的 VC 模型。虽然我们的模型只训练了 20 名英语使用者，但它可以推广到各种语音转换任务，例如任意对多、跨语言和唱歌转换。使用风格编码器，我们的框架还可以将普通阅读语音转换为风格语音，例如情感和假声语音。对非并行多对多语音转换任务的主观和客观评估实验表明，我们的模型产生了自然的声音，接近最先进的基于文本到语音 (TTS) 的语音转换方法的音质，无需文本标签。此外，我们的模型是完全卷积的，并且具有比实时速度更快的声码器，例如 Parallel WaveGAN，可以执行实时语音转换。

更新日期：2021-07-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文