当前位置: X-MOL 学术EURASIP J. Audio Speech Music Proc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Segment boundary detection directed attention for online end-to-end speech recognition
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2020-01-30 , DOI: 10.1186/s13636-020-0170-z
Junfeng Hou , Wu Guo , Yan Song , Li-Rong Dai

Attention-based encoder-decoder models have recently shown competitive performance for automatic speech recognition (ASR) compared to conventional ASR systems. However, how to employ attention models for online speech recognition still needs to be explored. Different from conventional attention models wherein the soft alignment is obtained by a pass over the entire input sequence, attention models for online recognition must learn online alignment to attend part of input sequence monotonically when generating output symbols. Based on the fact that every output symbol is corresponding to a segment of input sequence, we propose a new attention mechanism for learning online alignment by decomposing the conventional alignment into two parts: segmentation —segment boundary detection with hard decision—and segment-directed attention —information aggregation within the segment with soft attention. The boundary detection is conducted along the time axis from left to right, and a decision is made for each input frame about whether it is a segment boundary or not. When a boundary is detected, the decoder generates an output symbol by attending the inputs within the corresponding segment. With the proposed attention mechanism, online speech recognition can be realized. The experimental results on TIMIT and WSJ dataset show that our proposed attention mechanism achieves comparable online performance with state-of-the-art models.

中文翻译:

用于在线端到端语音识别的分段边界检测定向关注

与传统的 ASR 系统相比,基于注意力的编码器-解码器模型最近显示出在自动语音识别 (ASR) 方面的竞争性能。然而,如何使用注意力模型进行在线语音识别仍有待探索。不同于传统的注意力模型,其中软对齐是通过遍历整个输入序列获得的,用于在线识别的注意力模型必须学习在线对齐以在生成输出符号时单调地关注部分输入序列。基于每个输出符号都对应一段输入序列的事实,我们提出了一种新的注意力机制来学习在线对齐,将传统对齐分解为两部分:分割——带有硬决策的段边界检测——和段定向注意——具有软注意力的段内的信息聚合。边界检测沿时间轴从左到右进行,对每个输入帧判断是否为段边界。当检测到边界时,解码器通过关注相应段内的输入来生成输出符号。通过提出的注意力机制,可以实现在线语音识别。在 TIMIT 和 WSJ 数据集上的实验结果表明,我们提出的注意力机制实现了与最先进模型相当的在线性能。并且为每个输入帧做出关于它是否是段边界的决定。当检测到边界时,解码器通过关注相应段内的输入来生成输出符号。通过提出的注意力机制,可以实现在线语音识别。在 TIMIT 和 WSJ 数据集上的实验结果表明,我们提出的注意力机制实现了与最先进模型相当的在线性能。并且为每个输入帧做出关于它是否是段边界的决定。当检测到边界时,解码器通过关注相应段内的输入来生成输出符号。通过提出的注意力机制,可以实现在线语音识别。在 TIMIT 和 WSJ 数据集上的实验结果表明,我们提出的注意力机制实现了与最先进模型相当的在线性能。
更新日期:2020-01-30
down
wechat
bug