Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion,Digital Signal Processing

当前位置： X-MOL 学术 › Digit. Signal Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion
Digital Signal Processing ( IF 2.9 ) Pub Date : 2021-05-26 , DOI: 10.1016/j.dsp.2021.103110
Xiao Kang , Hao Huang , Ying Hu , Zhihua Huang

Vector quantized variational autoencoder (VQ-VAE) has recently become an increasingly popular method in non-parallel zero-shot voice conversion (VC). The reason behind is that VQ-VAE is capable of disentangling the content and the speaker representations from the speech by using a content encoder and a speaker encoder, which is suitable for the VC task that makes the speech of a source speaker sound like the speech of the target speaker without changing the linguistic content. However, the converted speech is not satisfying because it is difficult to disentangle the pure content representations from the acoustic features due to the lack of linguistic supervision for the content encoder. To address this issue, under the framework of VQ-VAE, connectionist temporal classification (CTC) loss is proposed to guide the content encoder to learn pure content representations by using an auxiliary network. Based on the fact that the CTC loss is not affected by the sequence length of the output of the content encoder, adding the linguistic supervision to the content encoder can be much easier. This non-parallel many-to-many voice conversion model is named as CTC-VQ-VAE. VC experiments on the CMU ARCTIC and VCTK corpus are carried out to evaluate the proposed method. Both the objective and the subjective results show that the proposed approach significantly improves the speech quality and speaker similarity of the converted speech, compared with the traditional VQ-VAE method.

中文翻译：

零样本语音转换中矢量量化变分自编码器的连接主义时间分类损失

矢量量化变分自编码器 (VQ-VAE) 最近已成为非并行零样本语音转换 (VC) 中越来越流行的方法。背后的原因是 VQ-VAE 能够通过使用内容编码器和说话人编码器从语音中解开内容和说话人表示，这适用于使源说话人的语音听起来像语音的 VC 任务不改变语言内容。然而，转换后的语音并不令人满意，因为由于缺乏对内容编码器的语言监督，很难将纯内容表示与声学特征分开。为了解决这个问题，在 VQ-VAE 的框架下，提出了连接主义时间分类（CTC）损失来指导内容编码器通过使用辅助网络学习纯内容表示。基于 CTC 损失不受内容编码器输出序列长度影响的事实，向内容编码器添加语言监督会容易得多。这种非并行的多对多语音转换模型被命名为 CTC-VQ-VAE。在 CMU ARCTIC 和 VCTK 语料库上进行 VC 实验以评估所提出的方法。客观和主观结果都表明，与传统的 VQ-VAE 方法相比，所提出的方法显着提高了转换后语音的语音质量和说话人相似度。基于 CTC 损失不受内容编码器输出序列长度影响的事实，向内容编码器添加语言监督会容易得多。这种非并行的多对多语音转换模型被命名为 CTC-VQ-VAE。在 CMU ARCTIC 和 VCTK 语料库上进行 VC 实验以评估所提出的方法。客观和主观结果都表明，与传统的 VQ-VAE 方法相比，所提出的方法显着提高了转换后语音的语音质量和说话人相似度。基于 CTC 损失不受内容编码器输出序列长度影响的事实，向内容编码器添加语言监督会容易得多。这种非并行的多对多语音转换模型被命名为 CTC-VQ-VAE。在 CMU ARCTIC 和 VCTK 语料库上进行 VC 实验以评估所提出的方法。客观和主观结果都表明，与传统的 VQ-VAE 方法相比，所提出的方法显着提高了转换后语音的语音质量和说话人相似度。在 CMU ARCTIC 和 VCTK 语料库上进行了 VC 实验以评估所提出的方法。客观和主观结果都表明，与传统的 VQ-VAE 方法相比，所提出的方法显着提高了转换后语音的语音质量和说话人相似度。在 CMU ARCTIC 和 VCTK 语料库上进行 VC 实验以评估所提出的方法。客观和主观结果都表明，与传统的 VQ-VAE 方法相比，所提出的方法显着提高了转换后语音的语音质量和说话人相似度。

更新日期：2021-06-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11