Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models
arXiv - CS - Sound Pub Date : 2021-07-20 , DOI: arxiv-2107.09428
Tianzi Wang, Yuya Fujita, Xuankai Chang, Shinji Watanabe

Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. With recent state-of-the-art attention-based automatic speech recognition (ASR) structure, NAR can realize promising real-time factor (RTF) improvement with only small degradation of accuracy compared to the autoregressive (AR) models. However, the recognition inference needs to wait for the completion of a full speech utterance, which limits their applications on low latency scenarios. To address this issue, we propose a novel end-to-end streaming NAR speech recognition system by combining blockwise-attention and connectionist temporal classification with mask-predict (Mask-CTC) NAR. During inference, the input audio is separated into small blocks and then processed in a blockwise streaming way. To address the insertion and deletion error at the edge of the output of each block, we apply an overlapping decoding strategy with a dynamic mapping trick that can produce more coherent sentences. Experimental results show that the proposed method improves online ASR recognition in low latency conditions compared to vanilla Mask-CTC. Moreover, it can achieve a much faster inference speed compared to the AR attention-based models. All of our codes will be publicly available at https://github.com/espnet/espnet.

中文翻译：

基于 Blockwise 非自回归模型的流式端到端 ASR

非自回归 (NAR) 建模在语音处理中越来越受到关注。凭借最近最先进的基于注意力的自动语音识别 (ASR) 结构，与自回归 (AR) 模型相比，NAR 可以实现有希望的实时因子 (RTF) 改进，而准确度降低很小。然而，识别推理需要等待完整语音的完成，这限制了它们在低延迟场景中的应用。为了解决这个问题，我们提出了一种新颖的端到端流式 NAR 语音识别系统，通过将 blockwise-attention 和连接时间分类与掩码预测 (Mask-CTC) NAR 相结合。在推理过程中，输入音频被分成小块，然后以块流方式进行处理。为了解决每个块输出边缘的插入和删除错误，我们应用重叠解码策略和动态映射技巧，可以产生更连贯的句子。实验结果表明，与vanilla Mask-CTC相比，所提出的方法在低延迟条件下提高了在线ASR识别。此外，与基于 AR 注意力的模型相比，它可以实现更快的推理速度。我们所有的代码都将在 https://github.com/espnet/espnet 上公开提供。与基于 AR 注意力的模型相比，它可以实现更快的推理速度。我们所有的代码都将在 https://github.com/espnet/espnet 上公开提供。与基于 AR 注意力的模型相比，它可以实现更快的推理速度。我们所有的代码都将在 https://github.com/espnet/espnet 上公开提供。

更新日期：2021-07-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>