End-to-End Bengali Speech Recognition,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

End-to-End Bengali Speech Recognition
arXiv - CS - Sound Pub Date : 2020-09-21 , DOI: arxiv-2009.09615
Sayan Mandal, Sarthak Yadav and Atul Rai

Bengali is a prominent language of the Indian subcontinent. However, while many state-of-the-art acoustic models exist for prominent languages spoken in the region, research and resources for Bengali are few and far between. In this work, we apply CTC based CNN-RNN networks, a prominent deep learning based end-to-end automatic speech recognition technique, to the Bengali ASR task. We also propose and evaluate the applicability and efficacy of small 7x3 and 3x3 convolution kernels which are prominently used in the computer vision domain primarily because of their FLOPs and parameter efficient nature. We propose two CNN blocks, 2-layer Block A and 4-layer Block B, with the first layer comprising of 7x3 kernel and the subsequent layers comprising solely of 3x3 kernels. Using the publicly available Large Bengali ASR Training data set, we benchmark and evaluate the performance of seven deep neural network configurations of varying complexities and depth on the Bengali ASR task. Our best model, with Block B, has a WER of 13.67, having an absolute reduction of 1.39% over comparable model with larger convolution kernels of size 41x11 and 21x11.

中文翻译：

端到端孟加拉语语音识别

孟加拉语是印度次大陆的主要语言。然而，虽然该地区的主要语言存在许多最先进的声学模型，但针对孟加拉语的研究和资源却寥寥无几。在这项工作中，我们将基于 CTC 的 CNN-RNN 网络（一种突出的基于深度学习的端到端自动语音识别技术）应用于孟加拉语 ASR 任务。我们还提出并评估了小型 7x3 和 3x3 卷积核的适用性和有效性，这些卷积核主要用于计算机视觉领域，主要是因为它们的 FLOP 和参数高效性。我们提出了两个 CNN 块，2 层块 A 和 4 层块 B，第一层由 7x3 内核组成，后续层仅由 3x3 内核组成。使用公开可用的大孟加拉语 ASR 训练数据集，我们对孟加拉语 ASR 任务中具有不同复杂性和深度的七种深度神经网络配置的性能进行了基准测试和评估。我们使用 Block B 的最佳模型的 WER 为 13.67，与具有 41x11 和 21x11 大小的更大卷积核的可比模型相比，绝对降低了 1.39%。

更新日期：2020-11-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>