A new joint CTC-attention-based speech recognition model with multi-level multi-head attention,EURASIP Journal on Audio, Speech, and Music Processing

当前位置： X-MOL 学术 › EURASIP J. Audio Speech Music Proc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
EURASIP Journal on Audio, Speech, and Music Processing ( IF 2.4 ) Pub Date : 2019-10-28 , DOI: 10.1186/s13636-019-0161-0
Chu-Xiong Qin , Wen-Lin Zhang , Dan Qu

A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER.

中文翻译：

一种具有多级多头注意力的新的基于 CTC-注意力的联合语音识别模型

一种称为联合连接主义时间分类 (CTC)-基于注意力的语音识别的方法最近受到越来越多的关注，并取得了令人印象深刻的性能。向基于注意力的模型添加额外 CTC 损失的混合端到端架构可能会对对齐施加额外限制。为了更好地探索端到端模型，我们提出了对特征提取和注意机制的改进。首先，我们引入了一个使用基于非负矩阵分解 (NMF) 的高级特征训练的联合模型。然后，我们通过结合多头注意力并计算多级输出的注意力分数，提出了一种混合注意力机制。在 TIMIT 上的实验表明，新方法使用我们最好的模型实现了最先进的性能。在《华尔街日报》上的实验表明，我们的方法的字错误率 (WER) 的绝对值仅比最佳参考方法低 0.2%，后者在更大的数据集上训练，并且击败了所有现有的端到端方法. LibriSpeech 的进一步实验表明，我们的方法也可与 WER 中最先进的端到端系统相媲美。

更新日期：2019-10-28

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>