Streaming automatic speech recognition with the transformer model,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Streaming automatic speech recognition with the transformer model
arXiv - CS - Sound Pub Date : 2020-01-08 , DOI: arxiv-2001.02674
Niko Moritz, Takaaki Hori, Jonathan Le Roux

Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR). Recently, the transformer architecture, which uses self-attention to model temporal context information, has been shown to achieve significantly lower word error rates (WERs) compared to recurrent neural network (RNN) based system architectures. Despite its success, the practical usage is limited to offline ASR tasks, since encoder-decoder architectures typically require an entire speech utterance as input. In this work, we propose a transformer based end-to-end ASR system for streaming ASR, where an output must be generated shortly after each spoken word. To achieve this, we apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech, which to our knowledge is the best published streaming end-to-end ASR result for this task.

中文翻译：

使用 Transformer 模型流式自动语音识别

基于编码器-解码器的序列到序列模型已经在端到端自动语音识别 (ASR) 中展示了最先进的结果。最近，与基于循环神经网络 (RNN) 的系统架构相比，使用自注意力对时间上下文信息进行建模的 Transformer 架构已被证明可以显着降低单词错误率 (WER)。尽管取得了成功，但实际应用仅限于离线 ASR 任务，因为编码器-解码器架构通常需要完整的语音作为输入。在这项工作中，我们提出了一种基于转换器的端到端 ASR 系统，用于流式传输 ASR，其中必须在每个说出的单词后不久生成输出。为了实现这一点，我们对编码器应用了时间限制的自我注意，并为编码器-解码器注意机制触发了注意。

更新日期：2020-07-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文