Attention based on-device streaming speech recognition with large speech corpus,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Attention based on-device streaming speech recognition with large speech corpus
arXiv - CS - Sound Pub Date : 2020-01-02 , DOI: arxiv-2001.00577
Kwangyoun Kim, Kyungmin Lee, Dhananjaya Gowda, Junmo Park, Sungsoo Kim, Sichen Jin, Young-Yoon Lee, Jinsu Yeo, Daehyun Kim, Seokyeong Jung, Jungin Lee, Myoungji Han, Chanwoo Kim

In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer-wise pre-training and data augmentation methods. In addition, we compressed our models by more than 3.4 times smaller using an iterative hyper low-rank approximation (LRA) method while minimizing the degradation in recognition accuracy. The memory footprint was further reduced with 8-bit quantization to bring down the final model size to lower than 39 MB. For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain.

中文翻译：

基于注意力机制的大型语音语料流语音识别

在本文中，我们提出了一种新的设备上自动语音识别 (ASR) 系统，该系统基于使用大型（> 10,000 小时）语料库训练的单调分块注意 (MoChA) 模型。我们主要通过使用连接时间分类器 (CTC) 和交叉熵 (CE) 损失、最小单词错误率 (MWER) 训练、分层预训练和数据的联合训练，获得了大约 90% 的一般领域的单词识别率增强方法。此外，我们使用迭代超低秩逼近 (LRA) 方法将模型压缩了 3.4 倍以上，同时最大限度地减少了识别精度的下降。8 位量化进一步减少了内存占用，将最终模型大小降低到 39 MB 以下。对于按需适应，我们将 MoChA 模型与统计 n-gram 模型融合在一起，

更新日期：2020-01-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文