当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
CarneliNet: Neural Mixture Model for Automatic Speech Recognition
arXiv - CS - Sound Pub Date : 2021-07-22 , DOI: arxiv-2107.10708
Aleksei Kalinov, Somshubra Majumdar, Jagadeesh Balam, Boris Ginsburg

End-to-end automatic speech recognition systems have achieved great accuracy by using deeper and deeper models. However, the increased depth comes with a larger receptive field that can negatively impact model performance in streaming scenarios. We propose an alternative approach that we call Neural Mixture Model. The basic idea is to introduce a parallel mixture of shallow networks instead of a very deep network. To validate this idea we design CarneliNet -- a CTC-based neural network composed of three mega-blocks. Each mega-block consists of multiple parallel shallow sub-networks based on 1D depthwise-separable convolutions. We evaluate the model on LibriSpeech, MLS and AISHELL-2 datasets and achieved close to state-of-the-art results for CTC-based models. Finally, we demonstrate that one can dynamically reconfigure the number of parallel sub-networks to accommodate the computational requirements without retraining.

中文翻译:

CarneliNet:自动语音识别的神经混合模型

通过使用越来越深的模型,端到端的自动语音识别系统已经取得了很高的准确性。然而,增加的深度伴随着更大的感受野,这会对流场景中的模型性能产生负面影响。我们提出了一种称为神经混合模型的替代方法。基本思想是引入浅层网络的并行混合,而不是非常深的网络。为了验证这个想法,我们设计了 CarneliNet——一个基于 CTC 的神经网络,由三个巨型块组成。每个巨型块由基于一维深度可分离卷积的多个并行浅层子网络组成。我们在 LibriSpeech、MLS 和 AISHELL-2 数据集上评估模型,并为基于 CTC 的模型取得了接近最先进的结果。最后,
更新日期:2021-07-23
down
wechat
bug