MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames
arXiv - CS - Sound Pub Date : 2021-02-25 , DOI: arxiv-2102.12841
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

Non-parallel voice conversion (VC) is a technique for training voice converters without a parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as benchmark methods. However, owing to their insufficient ability to grasp time-frequency structures, their application is limited to mel-cepstrum conversion and not mel-spectrogram conversion despite recent advances in mel-spectrogram vocoders. To overcome this, CycleGAN-VC3, an improved variant of CycleGAN-VC2 that incorporates an additional module called time-frequency adaptive normalization (TFAN), has been proposed. However, an increase in the number of learned parameters is imposed. As an alternative, we propose MaskCycleGAN-VC, which is another extension of CycleGAN-VC2 and is trained using a novel auxiliary task called filling in frames (FIF). With FIF, we apply a temporal mask to the input mel-spectrogram and encourage the converter to fill in missing frames based on surrounding frames. This task allows the converter to learn time-frequency structures in a self-supervised manner and eliminates the need for an additional module such as TFAN. A subjective evaluation of the naturalness and speaker similarity showed that MaskCycleGAN-VC outperformed both CycleGAN-VC2 and CycleGAN-VC3 with a model size similar to that of CycleGAN-VC2. Audio samples are available at http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html.

中文翻译：

MaskCycleGAN-VC：通过填充帧来学习非并行语音转换

非并行语音转换（VC）是一种用于在没有并行语料库的情况下训练语音转换器的技术。基于周期一致的对抗网络的VC（CycleGAN-VC和CycleGAN-VC2）被广泛用作基准测试方法。然而，由于它们缺乏掌握时频结构的能力，因此尽管近来在声谱图声码器方面取得了进步，但它们的应用仅限于梅尔-倒谱转换而不是梅尔-谱图转换。为了克服这个问题，已经提出了CycleGAN-VC3，它是CycleGAN-VC2的改进变体，它结合了一个称为时频自适应归一化（TFAN）的附加模块。但是，增加了学习参数的数量。作为替代方案，我们建议使用MaskCycleGAN-VC，这是CycleGAN-VC2的另一扩展，并使用一种称为填充帧（FIF）的新颖辅助任务进行训练。使用FIF，我们对输入的Mel频谱图应用时间掩码，并鼓励转换器基于周围的帧来填充丢失的帧。该任务使转换器能够以自我监督的方式学习时频结构，并消除了对额外模块（例如TFAN）的需求。对自然性和说话人相似性的主观评估表明，MaskCycleGAN-VC的模型大小与CycleGAN-VC2相似，其表现优于CycleGAN-VC2和CycleGAN-VC3。音频样本可从http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html获得。我们对输入的Mel频谱图应用时间遮罩，并鼓励转换器根据周围的帧填充丢失的帧。该任务使转换器能够以自我监督的方式学习时频结构，并消除了对额外模块（例如TFAN）的需求。对自然性和说话人相似性的主观评估表明，MaskCycleGAN-VC的模型大小与CycleGAN-VC2相似，其表现优于CycleGAN-VC2和CycleGAN-VC3。音频样本可从http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html获得。我们对输入的Mel频谱图应用时间遮罩，并鼓励转换器根据周围的帧填充丢失的帧。该任务使转换器能够以自我监督的方式学习时频结构，并消除了对额外模块（例如TFAN）的需求。对自然性和说话人相似性的主观评估表明，MaskCycleGAN-VC的模型大小与CycleGAN-VC2相似，其表现优于CycleGAN-VC2和CycleGAN-VC3。音频样本可从http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html获得。对自然性和说话人相似性的主观评估表明，MaskCycleGAN-VC的模型大小与CycleGAN-VC2相似，其表现优于CycleGAN-VC2和CycleGAN-VC3。音频样本可从http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html获得。对自然性和说话人相似性的主观评估表明，MaskCycleGAN-VC的模型大小与CycleGAN-VC2相似，其表现优于CycleGAN-VC2和CycleGAN-VC3。音频样本可从http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html获得。

更新日期：2021-02-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文