当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
arXiv - CS - Sound Pub Date : 2021-07-21 , DOI: arxiv-2107.10394
Yinghao Aaron Li, Ali Zare, Nima Mesgarani

We present an unsupervised non-parallel many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2. Using a combination of adversarial source classifier loss and perceptual loss, our model significantly outperforms previous VC models. Although our model is trained only with 20 English speakers, it generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion. Using a style encoder, our framework can also convert plain reading speech into stylistic speech, such as emotional and falsetto speech. Subjective and objective evaluation experiments on a non-parallel many-to-many voice conversion task revealed that our model produces natural sounding voices, close to the sound quality of state-of-the-art text-to-speech (TTS) based voice conversion methods without the need for text labels. Moreover, our model is completely convolutional and with a faster-than-real-time vocoder such as Parallel WaveGAN can perform real-time voice conversion.

中文翻译:

StarGANv2-VC:一种用于自然声音转换的多样化、无监督、非并行框架

我们提出了一种使用称为 StarGAN v2 的生成对抗网络 (GAN) 的无监督非并行多对多语音转换 (VC) 方法。使用对抗性源分类器损失和感知损失的组合,我们的模型明显优于以前的 VC 模型。虽然我们的模型只训练了 20 名英语使用者,但它可以推广到各种语音转换任务,例如任意对多、跨语言和唱歌转换。使用风格编码器,我们的框架还可以将普通阅读语音转换为风格语音,例如情感和假声语音。对非并行多对多语音转换任务的主观和客观评估实验表明,我们的模型产生了自然的声音,接近最先进的基于文本到语音 (TTS) 的语音转换方法的音质,无需文本标签。此外,我们的模型是完全卷积的,并且具有比实时速度更快的声码器,例如 Parallel WaveGAN,可以执行实时语音转换。
更新日期:2021-07-23
down
wechat
bug