Sequence-to-Sequence Acoustic Modeling for Voice Conversion,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Sequence-to-Sequence Acoustic Modeling for Voice Conversion
arXiv - CS - Sound Pub Date : 2018-10-16 , DOI: arxiv-1810.06865
Jing-Xuan Zhang, Zhen-Hua Ling, Li-Juan Liu, Yuan Jiang, Li-Rong Dai

In this paper, a neural network named Sequence-to-sequence ConvErsion NeTwork (SCENT) is presented for acoustic modeling in voice conversion. At training stage, a SCENT model is estimated by aligning the feature sequences of source and target speakers implicitly using attention mechanism. At conversion stage, acoustic features and durations of source utterances are converted simultaneously using the unified acoustic model. Mel-scale spectrograms are adopted as acoustic features which contain both excitation and vocal tract descriptions of speech signals. The bottleneck features extracted from source speech using an automatic speech recognition (ASR) model are appended as auxiliary input. A WaveNet vocoder conditioned on Mel-spectrograms is built to reconstruct waveforms from the outputs of the SCENT model. It is worth noting that our proposed method can achieve appropriate duration conversion which is difficult in conventional methods. Experimental results show that our proposed method obtained better objective and subjective performance than the baseline methods using Gaussian mixture models (GMM) and deep neural networks (DNN) as acoustic models. This proposed method also outperformed our previous work which achieved the top rank in Voice Conversion Challenge 2018. Ablation tests further confirmed the effectiveness of several components in our proposed method.

中文翻译：

语音转换的序列到序列声学建模

在本文中，提出了一种名为序列到序列转换网络 (SCENT) 的神经网络，用于语音转换中的声学建模。在训练阶段，通过使用注意力机制隐式对齐源和目标说话者的特征序列来估计 SCENT 模型。在转换阶段，使用统一的声学模型同时转换源话语的声学特征和持续时间。Mel-scale 频谱图被用作声学特征，其中包含语音信号的激励和声道描述。使用自动语音识别 (ASR) 模型从源语音中提取的瓶颈特征被附加为辅助输入。构建了一个以 Mel 频谱图为条件的 WaveNet 声码器，用于从 SCENT 模型的输出重建波形。值得注意的是，我们提出的方法可以实现传统方法难以实现的适当时长转换。实验结果表明，我们提出的方法比使用高斯混合模型（GMM）和深度神经网络（DNN）作为声学模型的基线方法获得了更好的客观和主观性能。这种提出的方法也优于我们之前的工作，后者在 2018 年语音转换挑战赛中名列前茅。消融测试进一步证实了我们提出的方法中几个组件的有效性。实验结果表明，我们提出的方法比使用高斯混合模型（GMM）和深度神经网络（DNN）作为声学模型的基线方法获得了更好的客观和主观性能。这种提出的方法也优于我们之前的工作，后者在 2018 年语音转换挑战赛中名列前茅。消融测试进一步证实了我们提出的方法中几个组件的有效性。实验结果表明，我们提出的方法比使用高斯混合模型（GMM）和深度神经网络（DNN）作为声学模型的基线方法获得了更好的客观和主观性能。这种提出的方法也优于我们之前的工作，后者在 2018 年语音转换挑战赛中名列前茅。消融测试进一步证实了我们提出的方法中几个组件的有效性。

更新日期：2020-01-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>