当前位置:
X-MOL 学术
›
arXiv.cs.SD
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency
arXiv - CS - Sound Pub Date : 2020-03-28 , DOI: arxiv-2003.12710 Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuo-yiin Chang, Wei Li, Raziel Alvarez, Zhifeng Chen, Chung-Cheng Chiu, David Garcia, Alex Gruenstein, Ke Hu, Minho Jin, Anjuli Kannan, Qiao Liang, Ian McGraw, Cal Peyser, Rohit Prabhavalkar, Golan Pundak, David Rybach, Yuan Shangguan, Yash Sheth, Trevor Strohman, Mirko Visontai, Yonghui Wu, Yu Zhang, Ding Zhao
arXiv - CS - Sound Pub Date : 2020-03-28 , DOI: arxiv-2003.12710 Tara N. Sainath, Yanzhang He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuo-yiin Chang, Wei Li, Raziel Alvarez, Zhifeng Chen, Chung-Cheng Chiu, David Garcia, Alex Gruenstein, Ke Hu, Minho Jin, Anjuli Kannan, Qiao Liang, Ian McGraw, Cal Peyser, Rohit Prabhavalkar, Golan Pundak, David Rybach, Yuan Shangguan, Yash Sheth, Trevor Strohman, Mirko Visontai, Yonghui Wu, Yu Zhang, Ding Zhao
Thus far, end-to-end (E2E) models have not been shown to outperform
state-of-the-art conventional models with respect to both quality, i.e., word
error rate (WER), and latency, i.e., the time the hypothesis is finalized after
the user stops speaking. In this paper, we develop a first-pass Recurrent
Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell
(LAS) rescorer that surpasses a conventional model in both quality and latency.
On the quality side, we incorporate a large number of utterances across varied
domains to increase acoustic diversity and the vocabulary seen by the model. We
also train with accented English speech to make the model more robust to
different pronunciations. In addition, given the increased amount of training
data, we explore a varied learning rate schedule. On the latency front, we
explore using the end-of-sentence decision emitted by the RNN-T model to close
the microphone, and also introduce various optimizations to improve the speed
of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and
latency tradeoff compared to a conventional model. For example, for the same
latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more
than 400-times smaller in model size.
中文翻译:
一种超越服务器端传统模型质量和延迟的流媒体端对端模型
到目前为止,端到端 (E2E) 模型在质量(即字错误率 (WER))和延迟(即假设在用户停止说话后最终确定。在本文中,我们开发了第一遍循环神经网络转换器 (RNN-T) 模型和第二遍聆听、出席、拼写 (LAS) 重新评分器,其在质量和延迟方面均优于传统模型。在质量方面,我们在不同领域合并了大量话语,以增加声学多样性和模型看到的词汇量。我们还使用带口音的英语语音进行训练,以使模型对不同的发音更加稳健。此外,鉴于训练数据量的增加,我们探索了不同的学习率计划。在延迟方面,我们探索使用 RNN-T 模型发出的句尾决策来关闭麦克风,并引入各种优化来提高 LAS 重新评分的速度。总的来说,我们发现与传统模型相比,RNN-T+LAS 提供了更好的 WER 和延迟权衡。例如,对于相同的延迟,RNN-T+LAS 在 WER 上获得了 8% 的相对改进,同时模型尺寸缩小了 400 多倍。
更新日期:2020-05-05
中文翻译:
一种超越服务器端传统模型质量和延迟的流媒体端对端模型
到目前为止,端到端 (E2E) 模型在质量(即字错误率 (WER))和延迟(即假设在用户停止说话后最终确定。在本文中,我们开发了第一遍循环神经网络转换器 (RNN-T) 模型和第二遍聆听、出席、拼写 (LAS) 重新评分器,其在质量和延迟方面均优于传统模型。在质量方面,我们在不同领域合并了大量话语,以增加声学多样性和模型看到的词汇量。我们还使用带口音的英语语音进行训练,以使模型对不同的发音更加稳健。此外,鉴于训练数据量的增加,我们探索了不同的学习率计划。在延迟方面,我们探索使用 RNN-T 模型发出的句尾决策来关闭麦克风,并引入各种优化来提高 LAS 重新评分的速度。总的来说,我们发现与传统模型相比,RNN-T+LAS 提供了更好的 WER 和延迟权衡。例如,对于相同的延迟,RNN-T+LAS 在 WER 上获得了 8% 的相对改进,同时模型尺寸缩小了 400 多倍。