当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording
arXiv - CS - Sound Pub Date : 2021-07-15 , DOI: arxiv-2107.07509
Hirofumi Inaguma, Tatsuya Kawahara

In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an auxiliary connectionist temporal classification (CTC) objective. We propose a block-synchronous beam search decoding to take advantage of efficient batched output-synchronous and low-latency input-synchronous searches. We also propose a VAD-free inference algorithm that leverages CTC probabilities to determine a suitable timing to reset the model states to tackle the vulnerability to long-form data. Experimental evaluations demonstrate that the block-synchronous decoding achieves comparable accuracy to the label-synchronous one. Moreover, the VAD-free inference can recognize long-form speech robustly for up to a few hours.

中文翻译:

用于非分段记录的无 VAD 流混合 CTC/Attention ASR

在这项工作中,我们提出了新颖的解码算法,以基于单调分块注意 (MoChA) 和辅助连接主义时间分类 (CTC) 目标,在没有语音活动检测 (VAD) 的未分段长格式录音上启用流式自动语音识别 (ASR) . 我们提出了一种块同步波束搜索解码,以利用高效的批量输出同步和低延迟输入同步搜索。我们还提出了一种无 VAD 推理算法,该算法利用 CTC 概率来确定合适的时间来重置模型状态以解决长格式数据的漏洞。实验评估表明,块同步解码实现了与标签同步解码相当的准确性。而且,
更新日期:2021-07-16
down
wechat
bug