当前位置: X-MOL 学术J. Circuits Syst. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
High-Quality Many-to-Many Voice Conversion Using Transitive Star Generative Adversarial Networks with Adaptive Instance Normalization
Journal of Circuits, Systems and Computers ( IF 0.9 ) Pub Date : 2021-02-19 , DOI: 10.1142/s0218126621501887
Yanping Li 1 , Zhengtao He 1 , Yan Zhang 2 , Zhen Yang 1
Affiliation  

This paper proposes a novel high-quality nonparallel many-to-many voice conversion method based on transitive star generative adversarial networks with adaptive instance normalization (Trans-StarGAN-VC with AdaIN). First, we improve the structure of generator with TransNets to make full use of hierarchical features associated with speech naturalness. In TransNets, many shortcut connections share hierarchical features between encoding and decoding part to capture sufficient linguistic and semantic information, which helps to provide natural sounding converted speech and accelerate the convergence of training process. Second, by incorporating AdaIN for style transfer, we enable the generator to learn sufficient speaker characteristic information directly from speech instead of using attribute labels, which also provides a promising framework for one-shot VC. Objective and subjective experiments with nonparallel training data show that our method significantly outperforms StarGAN-VC in both speech naturalness and speaker similarity. The mean values of mean opinion score (MOS) and ABX are increased by 24.5% and 10.7%, respectively. The comparison of spectrogram also shows that our method can provide more complete harmonic structures and details, and effectively bridge the gap between converted speech and target speech.

中文翻译:

使用具有自适应实例归一化的传递星生成对抗网络的高质量多对多语音转换

本文提出了一种基于具有自适应实例归一化的传递星生成对抗网络(Trans-StarGAN-VC with AdaIN)的新型高质量非并行多对多语音转换方法。首先,我们使用 TransNets 改进生成器的结构,以充分利用与语音自然度相关的层次特征。在 TransNets 中,许多快捷连接在编码和解码部分之间共享层次特征,以捕获足够的语言和语义信息,这有助于提供自然发音的转换语音并加速训练过程的收敛。其次,通过结合 AdaIN 进行风格迁移,我们使生成器能够直接从语音中学习足够的说话人特征信息,而不是使用属性标签,这也为一次性 VC 提供了一个有前途的框架。使用非平行训练数据进行的客观和主观实验表明,我们的方法在语音自然度和说话人相似度方面都显着优于 StarGAN-VC。平均意见得分 (MOS) 和 ABX 的平均值分别增加了 24.5% 和 10.7%。频谱图的对比也表明,我们的方法可以提供更完整的谐波结构和细节,有效地弥合转换后的语音和目标语音之间的差距。
更新日期:2021-02-19
down
wechat
bug