当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Streaming Multi-speaker ASR with RNN-T
arXiv - CS - Computation and Language Pub Date : 2020-11-23 , DOI: arxiv-2011.11671
Ilya Sklyar, Anna Piunova, Yulan Liu

Recent research shows end-to-end ASR systems can recognize overlapped speech from multiple speakers. However, all published works have assumed no latency constraints during inference, which does not hold for most voice assistant interactions. This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T) that has been shown to provide high recognition accuracy at a low latency online recognition regime. We investigate two approaches to multi-speaker model training of the RNN-T: deterministic output-target assignment and permutation invariant training. We show that guiding separation with speaker order labels in the former case enhances the high-level speaker tracking capability of RNN-T. Apart from that, with multistyle training on single- and multi-speaker utterances, the resulting models gain robustness against ambiguous numbers of speakers during inference. Our best model achieves a WER of 10.2% on simulated 2-speaker LibriSpeech data, which is competitive with the previously reported state-of-the-art nonstreaming model (10.3%), while the proposed model could be directly applied for streaming applications.

中文翻译:

带RNN-T的流式多扬声器ASR

最新研究表明,端到端ASR系统可以识别来自多个扬声器的重叠语音。但是,所有已发表的作品都假定推理期间没有等待时间限制,这对于大多数语音助手交互都不成立。这项工作集中在基于递归神经网络换能器(RNN-T)的多说话者语音识别上,该技术已被证明可以在低延迟的在线识别机制下提供高识别精度。我们研究RNN-T多扬声器模型训练的两种方法:确定性输出目标分配和置换不变训练。我们显示,在前一种情况下,使用说话者顺序标签进行引导分离可以增强RNN-T的高级说话者跟踪能力。除此之外,通过针对单人和多人说话的多风格训练,结果模型在推理过程中针对不明确的说话者数量获得了鲁棒性。我们的最佳模型在模拟的2扬声器LibriSpeech数据上实现了10.2%的WER,与先前报告的最新非流模型(10.3%)相比,该提议模型可以直接应用于流应用。
更新日期:2020-11-25
down
wechat
bug