U2-VC: one-shot voice conversion using two-level nested U-structure,EURASIP Journal on Audio, Speech, and Music Processing

当前位置： X-MOL 学术 › EURASIP J. Audio Speech Music Proc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

U2-VC: one-shot voice conversion using two-level nested U-structure
EURASIP Journal on Audio, Speech, and Music Processing ( IF 2.4 ) Pub Date : 2021-11-24 , DOI: 10.1186/s13636-021-00226-3
Fangkun Liu _{1,

2} , Hui Wang ₁ , Renhua Peng _{2,

3} , Chengshi Zheng _{2,

3} , Xiaodong Li _{2,

3}

Affiliation

Voice conversion is to transform a source speaker to the target one, while keeping the linguistic content unchanged. Recently, one-shot voice conversion gradually becomes a hot topic for its potentially wide range of applications, where it has the capability to convert the voice from any source speaker to any other target speaker even when both the source speaker and the target speaker are unseen during training. Although a great progress has been made in one-shot voice conversion, the naturalness of the converted speech remains a challenging problem. To further improve the naturalness of the converted speech, this paper proposes a two-level nested U-structure (U2-Net) voice conversion algorithm called U2-VC. The U2-Net can extract both local feature and multi-scale feature of log-mel spectrogram, which can help to learn the time-frequency structures of the source speech and the target speech. Moreover, we adopt sandwich adaptive instance normalization (SaAdaIN) in decoder for speaker identity transformation to retain more content information of the source speech while maintaining the speaker similarity between the converted speech and the target speech. Experiments on VCTK dataset show that U2-VC outperforms many SOTA approaches including AGAIN-VC and AdaIN-VC in terms of both objective and subjective measurements.

中文翻译：

U2-VC：使用两级嵌套 U 结构的一次性语音转换

语音转换是将源说话人转换为目标说话人，同时保持语言内容不变。最近，一次性语音转换因其潜在的广泛应用逐渐成为热门话题，即使在源说话者和目标说话者都看不见的情况下，它也有能力将来自任何源说话者的语音转换为任何其他目标说话者在训练中。尽管一次性语音转换取得了很大进展，但转换后语音的自然度仍然是一个具有挑战性的问题。为了进一步提高转换语音的自然度，本文提出了一种称为U2-VC的两级嵌套U结构（U2-Net）语音转换算法。U2-Net 可以同时提取 log-mel 谱图的局部特征和多尺度特征，有助于学习源语音和目标语音的时频结构。此外，我们在解码器中采用三明治自适应实例归一化（SaAdaIN）进行说话人身份转换，以保留源语音的更多内容信息，同时保持转换后的语音和目标语音之间的说话人相似度。在 VCTK 数据集上的实验表明，U2-VC 在客观和主观测量方面都优于许多 SOTA 方法，包括 AGAIN-VC 和 AdaIN-VC。

更新日期：2021-11-24

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>