On the localness modeling for the self-attention based end-to-end speech synthesis.,Neural Networks

当前位置： X-MOL 学术 › Neural Netw. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

On the localness modeling for the self-attention based end-to-end speech synthesis.
Neural Networks ( IF 6.0 ) Pub Date : 2020-02-11 , DOI: 10.1016/j.neunet.2020.01.034
Shan Yang ₁ , Heng Lu ₂ , Shiyin Kang ₂ , Liumeng Xue ₁ , Jinba Xiao ₁ , Dan Su ₂ , Lei Xie ₁ , Dong Yu ₂

Affiliation

Attention based end-to-end speech synthesis achieves better performance in both prosody and quality compared to the conventional "front-end"-"back-end" structure. But training such end-to-end framework is usually time-consuming because of the use of recurrent neural networks. To enable parallel calculation and long-range dependency modeling, a solely self-attention based framework named Transformer is proposed recently in the end-to-end family. However, it lacks position information in sequential modeling, so that the extra position representation is crucial to achieve good performance. Besides, the weighted sum form of self-attention is conducted over the whole input sequence when computing latent representation, which may disperse the attention to the whole input sequence other than focusing on the more important neighboring input states, resulting in generation errors. In this paper, we introduce two localness modeling methods to enhance the self-attention based representation for speech synthesis, which maintain the abilities of parallel computation and global-range dependency modeling in self-attention while improving the generation stability. We systematically analyze the solely self-attention based end-to-end speech synthesis framework, and unveil the importance of local context. Then we add the proposed relative-position-aware method to enhance local edges and experiment with different architectures to examine the effectiveness of localness modeling. In order to achieve query-specific window and discard the hyper-parameter of the relative-position-aware approach, we further conduct Gaussian-based bias to enhance localness. Experimental results indicate that the two proposed localness enhanced methods can both improve the performance of the self-attention model, especially when applied to the encoder part. And the query-specific window of Gaussian bias approach is more robust compared with the fixed relative edges.

中文翻译：

基于局部注意力的端到端语音合成的局部性建模。

与常规的“前端”-“后端”结构相比，基于注意力的端到端语音合成在韵律和质量上都实现了更好的性能。但是由于使用循环神经网络，训练这样的端到端框架通常很耗时。为了实现并行计算和远程依赖关系建模，最近在端到端系列中提出了一个基于自我关注的框架，名为Transformer。但是，它在顺序建模中缺少位置信息，因此额外的位置表示对于实现良好性能至关重要。此外，在计算潜在表示时，会在整个输入序列上进行自我注意的加权和形式，除了专注于更重要的相邻输入状态之外，这可能会将注意力分散到整个输入序列上，从而导致生成错误。在本文中，我们介绍了两种局部性建模方法来增强基于自我注意的语音合成表示，这些方法在保持自我注意的同时保持并行计算和全局范围依赖项建模的能力，同时提高了生成稳定性。我们系统地分析了仅基于自我注意的端到端语音合成框架，并揭示了本地上下文的重要性。然后，我们添加了拟议的相对位置感知方法来增强局部边缘，并尝试使用不同的体系结构来检验局部性建模的有效性。为了实现特定于查询的窗口并丢弃相对位置感知方法的超参数，我们进一步进行基于高斯的偏差以增强局部性。实验结果表明，两种提出的局部性增强方法都可以提高自注意力模型的性能，特别是应用于编码器部分时。与固定的相对边相比，高斯偏差方法的查询特定窗口更加健壮。

更新日期：2020-02-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11