当前位置: X-MOL 学术Neural Process Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Time-Frequency Localization Using Deep Convolutional Maxout Neural Network in Persian Speech Recognition
Neural Processing Letters ( IF 3.1 ) Pub Date : 2022-08-27 , DOI: 10.1007/s11063-022-11006-1
Arash Dehghani , Seyyed Ali Seyyedsalehi

In this paper, a CNN-based structure for the time-frequency localization of information is proposed for Persian speech recognition. Research has shown that the receptive fields’ spectrotemporal plasticity of some neurons in mammals’ primary auditory cortex and midbrain makes localization facilities improve recognition performance. Over the past few years, much work has been done to localize time-frequency information in ASR systems, using the spatial or temporal immutability properties of methods such as HMMs, TDNNs, CNNs, and LSTM-RNNs. However, most of these models have large parameter volumes and are challenging to train. For this purpose, we have presented a structure called Time-Frequency Convolutional Maxout Neural Network (TFCMNN) in which parallel time-domain and frequency-domain 1D-CMNNs are applied simultaneously and independently to the spectrogram, and then their outputs are concatenated and applied jointly to a fully connected Maxout network for classification. To improve the performance of this structure, we have used newly developed methods and models such as Dropout, maxout, and weight normalization. Two sets of experiments were designed and implemented on the FARSDAT dataset to evaluate the performance of this model compared to conventional 1D-CMNN models. According to the experimental results, the average recognition score of TFCMNN models is about 1.6% higher than the average of conventional 1D-CMNN models. In addition, the average training time of the TFCMNN models is about 17 h lower than the average training time of traditional models. Therefore, as proven in other sources, time-frequency localization in ASR systems increases system accuracy and speeds up the training process.



中文翻译:

在波斯语语音识别中使用深度卷积 Maxout 神经网络进行时频定位

在本文中,提出了一种基于 CNN 的信息时频定位结构,用于波斯语语音识别。研究表明,哺乳动物初级听觉皮层和中脑中某些神经元的感受野的时域可塑性使定位设施提高了识别性能。在过去的几年里,利用 HMM、TDNN、CNN 和 LSTM-RNN 等方法的空间或时间不变性特性,在 ASR 系统中定位时频信息方面做了大量工作。然而,这些模型中的大多数都具有较大的参数量并且难以训练。以此目的,我们提出了一种称为时频卷积 Maxout 神经网络 (TFCMNN) 的结构,其中并行时域和频域 1D-CMNN 同时独立地应用于频谱图,然后将它们的输出连接起来并共同应用于一个完全连接 Maxout 网络进行分类。为了提高这种结构的性能,我们使用了新开发的方法和模型,例如 Dropout、maxout 和权重归一化。在 FARSDAT 数据集上设计并实施了两组实验,以评估该模型与传统 1D-CMNN 模型相比的性能。根据实验结果,TFCMNN 模型的平均识别分数比常规 1D-CMNN 模型的平均识别分数高约 1.6%。此外,TFCMNN 模型的平均训练时间比传统模型的平均训练时间减少了约 17 小时。因此,正如其他来源所证明的那样,ASR 系统中的时频定位提高了系统准确性并加快了训练过程。

更新日期:2022-08-27
down
wechat
bug