Exploring attention mechanisms based on summary information for end-to-end automatic speech recognition,Neurocomputing

当前位置： X-MOL 学术 › Neurocomputing › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Exploring attention mechanisms based on summary information for end-to-end automatic speech recognition
Neurocomputing ( IF 6 ) Pub Date : 2021-09-13 , DOI: 10.1016/j.neucom.2021.09.017
Jiabin Xue ₁ , Tieran Zheng ₁ , Jiqing Han ₁

Affiliation

Recent studies have confirmed that attention mechanisms with location constraint strategy are helpful to reduce the misrecognition caused by incorrect alignments in attention-based end-to-end automatic speech recognition (E2E ASR) systems. The significant advantage of these mechanisms is that they consider the monotonicity of the alignment by employing a location constraint vector. This vector is directly obtained from historical attention scores for most such attention mechanisms. However, an unreasonable vector may become an additional interference when an inaccurate historical attention score occurs. Moreover, the subsequent process of attention scoring will be affected by the interference continuously. To address the problem, we obtain a reasonable location constraint vector from the matching relationship between the historical output information and the summary information, where the summary information includes content and temporal information about speech sequence. We further propose an enhanced location constrained attention mechanism, i.e., summary constrained (SC) attention mechanism, to generate the vector by a matching relationship-based neural network. We use a summary subspace embedding learned by a linear subspace projection to represent the summary information. Furthermore, considering the complementarity of the SC and typical location constrained attention mechanisms, a fused attention mechanism is used to generate a more reasonable vector by combining the two mechanisms. The SC and fused attention mechanisms-based E2E ASR systems were evaluated on a Switchboard conversational telephone speech recognition. The experimental results show that our mechanisms obtained the relative reductions of $10.6 %$ and $16.7 %$ in the word error rate compared with the baseline system.

中文翻译：

基于摘要信息的端到端自动语音识别注意力机制探索

最近的研究证实，具有位置约束策略的注意力机制有助于减少基于注意力的端到端自动语音识别（E2E ASR）系统中由于错误对齐而导致的误识别。这些机制的显着优点是它们通过使用位置约束向量来考虑对齐的单调性。对于大多数此类注意力机制，该向量直接从历史注意力分数中获得。然而，当历史注意力得分不准确时，不合理的向量可能会成为额外的干扰。而且，随后的注意力评分过程会不断受到干扰的影响。为了解决这个问题，我们从历史输出信息和摘要信息的匹配关系中得到一个合理的位置约束向量，其中摘要信息包括语音序列的内容和时间信息。我们进一步提出了一种增强的位置约束注意机制，即摘要约束（SC）注意机制，通过基于匹配关系的神经网络生成向量。我们使用通过线性子空间投影学习的摘要子空间嵌入来表示摘要信息。此外，考虑到 SC 和典型位置约束注意机制的互补性，融合注意机制被用于通过结合这两种机制来生成更合理的向量。SC 和基于融合注意机制的 E2E ASR 系统在总机对话电话语音识别上进行了评估。实验结果表明，我们的机制获得了相对减少 $10.6 %$ 和 $16.7 %$ 在词错误率与基线系统相比。

更新日期：2021-09-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>