当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
On Addressing Practical Challenges for RNN-Transduce
arXiv - CS - Sound Pub Date : 2021-04-27 , DOI: arxiv-2105.00858
Rui Zhao, Jian Xue, Jinyu Li, Wenning Wei, Lei He, Yifan Gong

In this paper, several works are proposed to address practical challenges for deploying RNN Transducer (RNN-T) based speech recognition system. These challenges are adapting a well-trained RNN-T model to a new domain without collecting the audio data, obtaining time stamps and confidence scores at word level. The first challenge is solved with a splicing data method which concatenates the speech segments extracted from the source domain data. To get the time stamp, a phone prediction branch is added to the RNN-T model by sharing the encoder for the purpose of force alignment. Finally, we obtain word-level confidence scores by utilizing several types of features calculated during decoding and from confusion network. Evaluated with Microsoft production data, the splicing data adaptation method improves the baseline and adaption with the text to speech method by 58.03% and 15.25% relative word error rate reduction, respectively. The proposed time stamping method can get less than 50ms word timing difference on average while maintaining the recognition accuracy of the RNN-T model. We also obtain high confidence annotation performance with limited computation cost

中文翻译:

关于解决RNN转换的实际挑战

在本文中,提出了一些工作来解决部署基于RNN换能器(RNN-T)的语音识别系统的实际挑战。这些挑战使训练有素的RNN-T模型适应新领域,而又没有收集音频数据,获得单词级别的时间戳和置信度得分。第一个挑战是通过拼接数据方法解决的,该方法将从源域数据中提取的语音片段进行级联。为了获得时间戳,通过共享编码器将电话预测分支添加到RNN-T模型中,以进行力对齐。最后,我们通过利用在解码过程中以及从混淆网络中计算出的几种类型的特征来获得单词级别的置信度得分。用Microsoft生产数据进行评估,拼接数据自适应方法将基线和文本语音转换方法的相对误码率分别降低了58.03%和15.25%。所提出的时间戳方法在保持RNN-T模型的识别精度的同时,平均可以获得不到50ms的字时序差异。我们还以有限的计算成本获得了高置信度注释性能
更新日期:2021-05-04
down
wechat
bug