DeepConversion: Voice conversion with limited parallel training data,Speech Communication

当前位置： X-MOL 学术 › Speech Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

DeepConversion: Voice conversion with limited parallel training data
Speech Communication ( IF 2.4 ) Pub Date : 2020-06-04 , DOI: 10.1016/j.specom.2020.05.004
Mingyang Zhang , Berrak Sisman , Li Zhao , Haizhou Li

A deep neural network approach to voice conversion usually depends on a large amount of parallel training data from source and target speakers. In this paper, we propose a novel conversion pipeline, DeepConversion, that leverages a large amount of non-parallel, multi-speaker data, but requires only a small amount of parallel training data. It is believed that we can represent the shared characteristics of speakers by training a speaker independent general model on a large amount of publicly available, non-parallel, multi-speaker speech data. Such general model can then be used to learn the mapping between source and target speaker more effectively from a limited amount of parallel training data. We also propose a strategy to make full use of the parallel data in all models along the pipeline. In particular, the parallel data is used to adapt the general model towards the source-target speaker pair to achieve a coarse grained conversion, and to develop a compact Error Reduction Network (ERN) for a fine-grained conversion. The parallel data is also used to adapt the WaveNet vocoder towards the source-target pair. The experiments show that DeepConversion that only uses a limited amount of parallel training data, consistently outperforms the traditional approaches that use a large amount of parallel training data, in both objective and subjective evaluations.

中文翻译：

DeepConversion：具有有限并行训练数据的语音转换

用于语音转换的深度神经网络方法通常取决于来自源说话者和目标说话者的大量并行训练数据。在本文中，我们提出了一种新颖的转换管道DeepConversion，该管道利用了大量的非并行，多扬声器数据，但只需要少量的并行训练数据。可以相信，通过在大量公共可用的，非并行的，多发言人的语音数据上训练独立于发言人的通用模型，可以代表发言人的共同特征。然后，可以使用这种通用模型从有限数量的并行训练数据中更有效地学习源说话者和目标说话者之间的映射。我们还提出了一种策略，以充分利用管道中所有模型中的并行数据。尤其是，并行数据用于使通用模型适应源-目标说话人对，以实现粗粒度转换，并开发紧凑的误差减少网络（ERN）以进行细粒度转换。并行数据还用于使WaveNet声码器适应源-目标对。实验表明，在客观和主观评估中，仅使用有限数量的并行训练数据的DeepConversion始终优于使用大量并行训练数据的传统方法。

更新日期：2020-06-04

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11