当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model
arXiv - CS - Sound Pub Date : 2020-03-17 , DOI: arxiv-2003.07482
Jinyu Li, Rui Zhao, Eric Sun, Jeremy H. M. Wong, Amit Das, Zhong Meng, and Yifan Gong

While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion, we argue that such conventional hybrid models can still be significantly improved. In this paper, we detail our recent efforts to improve conventional hybrid LSTM acoustic models for high-accuracy and low-latency automatic speech recognition. To achieve high accuracy, we use a contextual layer trajectory LSTM (cltLSTM), which decouples the temporal modeling and target classification tasks, and incorporates future context frames to get more information for accurate acoustic modeling. We further improve the training strategy with sequence-level teacher-student learning. To obtain low latency, we design a two-head cltLSTM, in which one head has zero latency and the other head has a small latency, compared to an LSTM. When trained with Microsoft's 65 thousand hours of anonymized training data and evaluated with test sets with 1.8 million words, the proposed two-head cltLSTM model with the proposed training strategy yields a 28.2\% relative WER reduction over the conventional LSTM acoustic model, with a similar perceived latency.

中文翻译:

具有双头上下文层轨迹 LSTM 模型的高精度和低延迟语音识别

虽然社区不断在传统混合模型上推广端到端模型,传统混合模型通常是使用交叉熵标准和序列判别训练标准训练的长短期记忆 (LSTM) 模型,但我们认为这种传统混合模型可以还是有很大提高的。在本文中,我们详细介绍了我们最近在改进传统混合 LSTM 声学模型以实现高精度和低延迟自动语音识别方面所做的努力。为了实现高精度,我们使用上下文层轨迹 LSTM (cltLSTM),它将时间建模和目标分类任务解耦,并结合未来的上下文框架以获得更多信息以进行准确的声学建模。我们通过序列级师生学习进一步改进了培训策略。为了获得低延迟,我们设计了一个双头 cltLSTM,与 LSTM 相比,其中一个头的延迟为零,另一个头的延迟很小。当使用微软 6.5 万小时的匿名训练数据进行训练并使用 180 万个单词的测试集进行评估时,建议的双头 cltLSTM 模型与建议的训练策略相比,比传统的 LSTM 声学模型产生了 28.2% 的相对 WER 降低类似的感知延迟。
更新日期:2020-03-18
down
wechat
bug