On the limit of English conversational speech recognition,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

On the limit of English conversational speech recognition
arXiv - CS - Sound Pub Date : 2021-05-03 , DOI: arxiv-2105.00982
Zoltán Tüske, George Saon, Brian Kingsbury

In our previous work we demonstrated that a single headed attention encoder-decoder model is able to reach state-of-the-art results in conversational speech recognition. In this paper, we further improve the results for both Switchboard 300 and 2000. Through use of an improved optimizer, speaker vector embeddings, and alternative speech representations we reduce the recognition errors of our LSTM system on Switchboard-300 by 4% relative. Compensation of the decoder model with the probability ratio approach allows more efficient integration of an external language model, and we report 5.9% and 11.5% WER on the SWB and CHM parts of Hub5'00 with very simple LSTM models. Our study also considers the recently proposed conformer, and more advanced self-attention based language models. Overall, the conformer shows similar performance to the LSTM; nevertheless, their combination and decoding with an improved LM reaches a new record on Switchboard-300, 5.0% and 10.0% WER on SWB and CHM. Our findings are also confirmed on Switchboard-2000, and a new state of the art is reported, practically reaching the limit of the benchmark.

中文翻译：

英语会话语音识别的局限性

在我们以前的工作中，我们证明了单头注意力编码器/解码器模型能够在会话语音识别中达到最新的结果。在本文中，我们进一步改善了Switchboard 300和2000的结果。通过使用改进的优化器，扬声器矢量嵌入和替代语音表示，我们将Switchboard-300上LSTM系统的识别误差降低了4％。使用概率比率方法对解码器模型进行补偿可以更有效地集成外部语言模型，并且我们使用非常简单的LSTM模型报告了Hub5'00的SWB和CHM部分的WER分别为5.9％和11.5％。我们的研究还考虑了最近提出的遵循者，以及更高级的基于自我注意的语言模型。总体而言，该构想器显示出与LSTM相似的性能；但是，它们的组合和解码以及改进的LM在Switchboard-300上创下了新记录，SWB和CHM的WER分别为5.0％和10.0％。我们在Switchboard-2000上的发现也得到了证实，并且报告了一种最新的技术水平，几乎达到了基准的极限。

更新日期：2021-05-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>