当前位置:
X-MOL 学术
›
arXiv.cs.SD
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Streaming automatic speech recognition with the transformer model
arXiv - CS - Sound Pub Date : 2020-01-08 , DOI: arxiv-2001.02674 Niko Moritz, Takaaki Hori, Jonathan Le Roux
arXiv - CS - Sound Pub Date : 2020-01-08 , DOI: arxiv-2001.02674 Niko Moritz, Takaaki Hori, Jonathan Le Roux
Encoder-decoder based sequence-to-sequence models have demonstrated
state-of-the-art results in end-to-end automatic speech recognition (ASR).
Recently, the transformer architecture, which uses self-attention to model
temporal context information, has been shown to achieve significantly lower
word error rates (WERs) compared to recurrent neural network (RNN) based system
architectures. Despite its success, the practical usage is limited to offline
ASR tasks, since encoder-decoder architectures typically require an entire
speech utterance as input. In this work, we propose a transformer based
end-to-end ASR system for streaming ASR, where an output must be generated
shortly after each spoken word. To achieve this, we apply time-restricted
self-attention for the encoder and triggered attention for the encoder-decoder
attention mechanism. Our proposed streaming transformer architecture achieves
2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech, which
to our knowledge is the best published streaming end-to-end ASR result for this
task.
中文翻译:
使用 Transformer 模型流式自动语音识别
基于编码器-解码器的序列到序列模型已经在端到端自动语音识别 (ASR) 中展示了最先进的结果。最近,与基于循环神经网络 (RNN) 的系统架构相比,使用自注意力对时间上下文信息进行建模的 Transformer 架构已被证明可以显着降低单词错误率 (WER)。尽管取得了成功,但实际应用仅限于离线 ASR 任务,因为编码器-解码器架构通常需要完整的语音作为输入。在这项工作中,我们提出了一种基于转换器的端到端 ASR 系统,用于流式传输 ASR,其中必须在每个说出的单词后不久生成输出。为了实现这一点,我们对编码器应用了时间限制的自我注意,并为编码器-解码器注意机制触发了注意。
更新日期:2020-07-02
中文翻译:
使用 Transformer 模型流式自动语音识别
基于编码器-解码器的序列到序列模型已经在端到端自动语音识别 (ASR) 中展示了最先进的结果。最近,与基于循环神经网络 (RNN) 的系统架构相比,使用自注意力对时间上下文信息进行建模的 Transformer 架构已被证明可以显着降低单词错误率 (WER)。尽管取得了成功,但实际应用仅限于离线 ASR 任务,因为编码器-解码器架构通常需要完整的语音作为输入。在这项工作中,我们提出了一种基于转换器的端到端 ASR 系统,用于流式传输 ASR,其中必须在每个说出的单词后不久生成输出。为了实现这一点,我们对编码器应用了时间限制的自我注意,并为编码器-解码器注意机制触发了注意。