CarneliNet: Neural Mixture Model for Automatic Speech Recognition,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

CarneliNet: Neural Mixture Model for Automatic Speech Recognition
arXiv - CS - Sound Pub Date : 2021-07-22 , DOI: arxiv-2107.10708
Aleksei Kalinov, Somshubra Majumdar, Jagadeesh Balam, Boris Ginsburg

End-to-end automatic speech recognition systems have achieved great accuracy by using deeper and deeper models. However, the increased depth comes with a larger receptive field that can negatively impact model performance in streaming scenarios. We propose an alternative approach that we call Neural Mixture Model. The basic idea is to introduce a parallel mixture of shallow networks instead of a very deep network. To validate this idea we design CarneliNet -- a CTC-based neural network composed of three mega-blocks. Each mega-block consists of multiple parallel shallow sub-networks based on 1D depthwise-separable convolutions. We evaluate the model on LibriSpeech, MLS and AISHELL-2 datasets and achieved close to state-of-the-art results for CTC-based models. Finally, we demonstrate that one can dynamically reconfigure the number of parallel sub-networks to accommodate the computational requirements without retraining.

中文翻译：

CarneliNet：自动语音识别的神经混合模型

通过使用越来越深的模型，端到端的自动语音识别系统已经取得了很高的准确性。然而，增加的深度伴随着更大的感受野，这会对流场景中的模型性能产生负面影响。我们提出了一种称为神经混合模型的替代方法。基本思想是引入浅层网络的并行混合，而不是非常深的网络。为了验证这个想法，我们设计了 CarneliNet——一个基于 CTC 的神经网络，由三个巨型块组成。每个巨型块由基于一维深度可分离卷积的多个并行浅层子网络组成。我们在 LibriSpeech、MLS 和 AISHELL-2 数据集上评估模型，并为基于 CTC 的模型取得了接近最先进的结果。最后，

更新日期：2021-07-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文