VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis,arXiv - CS - Multimedia

当前位置： X-MOL 学术 › arXiv.cs.MM › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis
arXiv - CS - Multimedia Pub Date : 2021-07-07 , DOI: arxiv-2107.03298
Hui Lu, Zhiyong Wu, Xixin Wu, Xu Li, Shiyin Kang, Xunying Liu, Helen Meng

This paper describes a variational auto-encoder based non-autoregressive text-to-speech (VAENAR-TTS) model. The autoregressive TTS (AR-TTS) models based on the sequence-to-sequence architecture can generate high-quality speech, but their sequential decoding process can be time-consuming. Recently, non-autoregressive TTS (NAR-TTS) models have been shown to be more efficient with the parallel decoding process. However, these NAR-TTS models rely on phoneme-level durations to generate a hard alignment between the text and the spectrogram. Obtaining duration labels, either through forced alignment or knowledge distillation, is cumbersome. Furthermore, hard alignment based on phoneme expansion can degrade the naturalness of the synthesized speech. In contrast, the proposed model of VAENAR-TTS is an end-to-end approach that does not require phoneme-level durations. The VAENAR-TTS model does not contain recurrent structures and is completely non-autoregressive in both the training and inference phases. Based on the VAE architecture, the alignment information is encoded in the latent variable, and attention-based soft alignment between the text and the latent variable is used in the decoder to reconstruct the spectrogram. Experiments show that VAENAR-TTS achieves state-of-the-art synthesis quality, while the synthesis speed is comparable with other NAR-TTS models.

中文翻译：

VAENAR-TTS：基于变分自动编码器的非自回归文本到语音合成

本文描述了一种基于变分自编码器的非自回归文本转语音 (VAENAR-TTS) 模型。基于序列到序列架构的自回归 TTS (AR-TTS) 模型可以生成高质量的语音，但它们的序列解码过程可能非常耗时。最近，非自回归 TTS (NAR-TTS) 模型已被证明在并行解码过程中更有效。然而，这些 NAR-TTS 模型依赖于音素级别的持续时间来生成文本和频谱图之间的硬对齐。通过强制对齐或知识蒸馏获得持续时间标签很麻烦。此外，基于音素扩展的硬对齐会降低合成语音的自然度。相比之下，提出的 VAENAR-TTS 模型是一种端到端的方法，不需要音素级的持续时间。VAENAR-TTS 模型不包含循环结构，并且在训练和推理阶段都是完全非自回归的。基于 VAE 架构，对齐信息被编码在潜在变量中，并且在解码器中使用文本和潜在变量之间基于注意力的软对齐来重建频谱图。实验表明，VAENAR-TTS 实现了最先进的合成质量，同时合成速度与其他 NAR-TTS 模型相当。在解码器中使用文本和潜在变量之间基于注意力的软对齐来重建频谱图。实验表明，VAENAR-TTS 实现了最先进的合成质量，同时合成速度与其他 NAR-TTS 模型相当。在解码器中使用文本和潜在变量之间基于注意力的软对齐来重建频谱图。实验表明，VAENAR-TTS 实现了最先进的合成质量，同时合成速度与其他 NAR-TTS 模型相当。

更新日期：2021-07-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文