A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency
arXiv - CS - Sound Pub Date : 2020-03-28 , DOI: arxiv-2003.12710
Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuo-yiin Chang, Wei Li, Raziel Alvarez, Zhifeng Chen, Chung-Cheng Chiu, David Garcia, Alex Gruenstein, Ke Hu, Minho Jin, Anjuli Kannan, Qiao Liang, Ian McGraw, Cal Peyser, Rohit Prabhavalkar, Golan Pundak, David Rybach, Yuan Shangguan, Yash Sheth, Trevor Strohman, Mirko Visontai, Yonghui Wu, Yu Zhang, Ding Zhao

Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size.

中文翻译：

一种超越服务器端传统模型质量和延迟的流媒体端对端模型

到目前为止，端到端 (E2E) 模型在质量（即字错误率 (WER)）和延迟（即假设在用户停止说话后最终确定。在本文中，我们开发了第一遍循环神经网络转换器 (RNN-T) 模型和第二遍聆听、出席、拼写 (LAS) 重新评分器，其在质量和延迟方面均优于传统模型。在质量方面，我们在不同领域合并了大量话语，以增加声学多样性和模型看到的词汇量。我们还使用带口音的英语语音进行训练，以使模型对不同的发音更加稳健。此外，鉴于训练数据量的增加，我们探索了不同的学习率计划。在延迟方面，我们探索使用 RNN-T 模型发出的句尾决策来关闭麦克风，并引入各种优化来提高 LAS 重新评分的速度。总的来说，我们发现与传统模型相比，RNN-T+LAS 提供了更好的 WER 和延迟权衡。例如，对于相同的延迟，RNN-T+LAS 在 WER 上获得了 8% 的相对改进，同时模型尺寸缩小了 400 多倍。

更新日期：2020-05-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>