当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition
arXiv - CS - Sound Pub Date : 2021-06-10 , DOI: arxiv-2106.05642
Di Wu, Binbin Zhang, Chao Yang, Zhendong Peng, Wenjing Xia, Xiaoyu Chen, Xin Lei

The unified streaming and non-streaming two-pass (U2) end-to-end model for speech recognition has shown great performance in terms of streaming capability, accuracy, real-time factor (RTF), and latency. In this paper, we present U2++, an enhanced version of U2 to further improve the accuracy. The core idea of U2++ is to use the forward and the backward information of the labeling sequences at the same time at training to learn richer information, and combine the forward and backward prediction at decoding to give more accurate recognition results. We also proposed a new data augmentation method called SpecSub to help the U2++ model to be more accurate and robust. Our experiments show that, compared with U2, U2++ shows faster convergence at training, better robustness to the decoding method, as well as consistent 5\% - 8\% word error rate reduction gain over U2. On the experiment of AISHELL-1, we achieve a 4.63\% character error rate (CER) with a non-streaming setup and 5.05\% with a streaming setup with 320ms latency by U2++. To the best of our knowledge, 5.05\% is the best-published streaming result on the AISHELL-1 test set.

中文翻译:

U2++:用于语音识别的统一两遍双向端到端模型

用于语音识别的统一流媒体和非流媒体两遍 (U2) 端到端模型在流媒体能力、准确性、实时因子 (RTF) 和延迟方面表现出出色的性能。在本文中,我们提出了 U2++,这是 U2 的增强版本,以进一步提高准确性。U2++的核心思想是在训练时同时利用标注序列的前向和后向信息学习更丰富的信息,在解码时结合前向和后向预测给出更准确的识别结果。我们还提出了一种称为 SpecSub 的新数据增强方法,以帮助 U2++ 模型更加准确和稳健。我们的实验表明,与 U2 相比,U2++ 在训练时收敛速度更快,对解码方法的鲁棒性更好,以及比 U2 一致的 5\% - 8\% 字错误率降低增益。在 AISHELL-1 的实验中,我们通过 U2++ 实现了 4.63\% 字符错误率(CER)和 5.05\% 的非流设置和 5.05\% 的流设置和 320 毫秒的延迟。据我们所知,5.05\% 是 AISHELL-1 测试集上发布的最佳流媒体结果。
更新日期:2021-06-11
down
wechat
bug