当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Statistical Voice Conversion with Quasi-Periodic WaveNet Vocoder
arXiv - CS - Sound Pub Date : 2019-07-21 , DOI: arxiv-1907.08940
Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda

In this paper, we investigate the effectiveness of a quasi-periodic WaveNet (QPNet) vocoder combined with a statistical spectral conversion technique for a voice conversion task. The WaveNet (WN) vocoder has been applied as the waveform generation module in many different voice conversion frameworks and achieves significant improvement over conventional vocoders. However, because of the fixed dilated convolution and generic network architecture, the WN vocoder lacks robustness against unseen input features and often requires a huge network size to achieve acceptable speech quality. Such limitations usually lead to performance degradation in the voice conversion task. To overcome this problem, the QPNet vocoder is applied, which includes a pitch-dependent dilated convolution component to enhance the pitch controllability and attain a more compact network than the WN vocoder. In the proposed method, input spectral features are first converted using a framewise deep neural network, and then the QPNet vocoder generates converted speech conditioned on the linearly converted prosodic and transformed spectral features. The experimental results confirm that the QPNet vocoder achieves significantly better performance than the same-size WN vocoder while maintaining comparable speech quality to the double-size WN vocoder. Index Terms: WaveNet, vocoder, voice conversion, pitch-dependent dilated convolution, pitch controllability


使用准周期性 WaveNet 声码器进行统计语音转换

在本文中,我们研究了准周期 WaveNet (QPNet) 声码器与统计频谱转换技术相结合的语音转换任务的有效性。WaveNet (WN) 声码器已在许多不同的语音转换框架中用作波形生成模块,并实现了对传统声码器的显着改进。然而,由于固定的扩张卷积和通用网络架构,WN 声码器对看不见的输入特征缺乏鲁棒性,并且通常需要巨大的网络规模才能实现可接受的语音质量。这种限制通常会导致语音转换任务的性能下降。为了克服这个问题,应用了 QPNet 声码器,它包括一个与音高相关的扩张卷积组件,以增强音高可控性并获得比 WN 声码器更紧凑的网络。在所提出的方法中,输入频谱特征首先使用逐帧深度神经网络进行转换,然后 QPNet 声码器以线性转换的韵律和转换的频谱特征为条件生成转换后的语音。实验结果证实,QPNet 声码器的性能明显优于相同尺寸的 WN 声码器,同时保持与双尺寸 WN 声码器相当的语音质量。索引词:WaveNet、声码器、语音转换、音调相关扩张卷积、音调可控性 然后 QPNet 声码器以线性转换的韵律和转换的频谱特征为条件生成转换后的语音。实验结果证实,QPNet 声码器的性能明显优于相同尺寸的 WN 声码器,同时保持与双尺寸 WN 声码器相当的语音质量。索引词:WaveNet、声码器、语音转换、音调相关扩张卷积、音调可控性 然后 QPNet 声码器以线性转换的韵律和转换的频谱特征为条件生成转换后的语音。实验结果证实,QPNet 声码器的性能明显优于相同尺寸的 WN 声码器,同时保持与双尺寸 WN 声码器相当的语音质量。索引词:WaveNet、声码器、语音转换、音调相关扩张卷积、音调可控性