Acoustic scene classification based on Mel spectrogram decomposition and model merging,Applied Acoustics

当前位置： X-MOL 学术 › Appl. Acoust. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Acoustic scene classification based on Mel spectrogram decomposition and model merging
Applied Acoustics ( IF 3.4 ) Pub Date : 2021-07-01 , DOI: 10.1016/j.apacoust.2021.108258
Tao Zhang , Guoqing Feng , Jinhua Liang , Tong An

Recently, excellent performance has been achieved in Acoustic Scene Classification (ASC) by using Convolutional Neural Networks (CNNs) and Mel spectrogram feature representations. The utilization of Mel spectrogram feature is attracting increasing attention for its effectiveness in improving the performance. In this paper, Gradient-weighted Class Activation Mapping (Grad-CAM), a CNN visualization technique, evaluates what information is perceived by a CNN. The importance of the regions in the Mel spectrogram varies significantly for the trained CNN. Some areas are significantly activated, some are not. Because the whole Mel spectrogram contains a large amount of information, some information will not take effect when the entire Mel spectrogram is fed into a CNN simultaneously, which leaves some leeway to improve the feature utilization of the Mel spectrogram. This paper proposed a method based on spectrogram decomposing and model merging to make local features more prominent and make CNN easier to train. Specifically, a whole Mel spectrogram is segmented along the time and frequency dimensions and then generates multiple sub-spectrograms. The sub-spectrograms in the same frequency bins share the same CNN sub-model. Then the prediction of the whole Mel spectrogram is obtained by merging the outputs of CNN sub-models. The experiment results show that our proposed algorithm outperforms the existing systems by 5.64%. Also, the results of confusion matrices and class activation maps demonstrate the effectiveness of Mel spectrogram decomposition.

中文翻译：

基于梅尔谱图分解和模型合并的声场景分类

最近，通过使用卷积神经网络 (CNN) 和梅尔谱图特征表示在声学场景分类 (ASC) 中取得了优异的性能。梅尔谱图特征的利用因其在提高性能方面的有效性而受到越来越多的关注。在本文中，梯度加权类激活映射 (Grad-CAM) 是一种 CNN 可视化技术，用于评估 CNN 感知到的信息。对于经过训练的 CNN，梅尔谱图中区域的重要性差异很大。有些区域被显着激活，有些则没有。由于整个梅尔谱图包含大量信息，当整个梅尔谱图同时输入CNN时，某些信息将不会生效，这为提高 Mel 谱图的特征利用率留下了一些余地。本文提出了一种基于频谱图分解和模型合并的方法，使局部特征更加突出，使CNN更容易训练。具体来说，将整个 Mel 频谱图沿时间和频率维度进行分割，然后生成多个子频谱图。相同频率仓中的子频谱图共享相同的 CNN 子模型。然后通过合并CNN子模型的输出得到整个Mel谱图的预测。实验结果表明，我们提出的算法优于现有系统 5.64%。此外，混淆矩阵和类激活图的结果证明了梅尔谱图分解的有效性。本文提出了一种基于频谱图分解和模型合并的方法，使局部特征更加突出，使CNN更容易训练。具体来说，将整个 Mel 频谱图沿时间和频率维度进行分割，然后生成多个子频谱图。相同频率仓中的子频谱图共享相同的 CNN 子模型。然后通过合并CNN子模型的输出得到整个Mel谱图的预测。实验结果表明，我们提出的算法优于现有系统 5.64%。此外，混淆矩阵和类激活图的结果证明了梅尔谱图分解的有效性。本文提出了一种基于频谱图分解和模型合并的方法，使局部特征更加突出，使CNN更容易训练。具体来说，将整个 Mel 频谱图沿时间和频率维度进行分割，然后生成多个子频谱图。相同频率仓中的子频谱图共享相同的 CNN 子模型。然后通过合并CNN子模型的输出得到整个Mel谱图的预测。实验结果表明，我们提出的算法优于现有系统 5.64%。此外，混淆矩阵和类激活图的结果证明了梅尔谱图分解的有效性。整个 Mel 谱图沿时间和频率维度进行分割，然后生成多个子谱图。相同频率仓中的子频谱图共享相同的 CNN 子模型。然后通过合并CNN子模型的输出得到整个Mel谱图的预测。实验结果表明，我们提出的算法优于现有系统 5.64%。此外，混淆矩阵和类激活图的结果证明了梅尔谱图分解的有效性。整个 Mel 谱图沿时间和频率维度进行分割，然后生成多个子谱图。相同频率仓中的子频谱图共享相同的 CNN 子模型。然后通过合并CNN子模型的输出得到整个Mel谱图的预测。实验结果表明，我们提出的算法优于现有系统 5.64%。此外，混淆矩阵和类激活图的结果证明了梅尔谱图分解的有效性。64%。此外，混淆矩阵和类激活图的结果证明了梅尔谱图分解的有效性。64%。此外，混淆矩阵和类激活图的结果证明了梅尔谱图分解的有效性。

更新日期：2021-07-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11