Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks,EURASIP Journal on Audio, Speech, and Music Processing

当前位置： X-MOL 学术 › EURASIP J. Audio Speech Music Proc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks
EURASIP Journal on Audio, Speech, and Music Processing ( IF 2.4 ) Pub Date : 2020-12-01 , DOI: 10.1186/s13636-020-00186-0
Shahin Amiriparian , Maurice Gerczuk , Sandra Ottl , Lukas Stappen , Alice Baird , Lukas Koebe , Björn Schuller

In this paper, we investigate the performance of two deep learning paradigms for the audio-based tasks of acoustic scene, environmental sound and domestic activity classification. In particular, a convolutional recurrent neural network (CRNN) and pre-trained convolutional neural networks (CNNs) are utilised. The CRNN is directly trained on Mel-spectrograms of the audio samples. For the pre-trained CNNs, the activations of one of the top layers of various architectures are extracted as feature vectors and used for training a linear support vector machine (SVM).Moreover, the predictions of the two models—the class probabilities predicted by the CRNN and the decision function of the SVM—are combined in a decision-level fusion to achieve the final prediction. For the pre-trained CNN networks we use as feature extractors, we further evaluate the effects of a range of configuration options, including the choice of the pre-training corpus. The system is evaluated on the acoustic scene classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2017) workshop, ESC-50 and the multi-channel acoustic recordings from DCASE 2018, task 5. We have refrained from additional data augmentation as our primary goal is to analyse the general performance of the proposed system on different datasets. We show that using our system, it is possible to achieve competitive performance on all datasets and demonstrate the complementarity of CRNNs and ImageNet pre-trained CNNs for acoustic classification tasks. We further find that in some cases, CNNs pre-trained on ImageNet can serve as more powerful feature extractors than AudioSet models. Finally, ImageNet pre-training is complimentary to more domain-specific knowledge, either in the form of the convolutional recurrent neural network (CRNN) trained directly on the target data or the AudioSet pre-trained models. In this regard, our findings indicate possible benefits of applying cross-modal pre-training of large CNNs to acoustic analysis tasks.

中文翻译：

使用卷积神经网络和循环神经网络进行音频识别的跨模态预训练和学习时空特性

在本文中，我们研究了两种深度学习范式在声学场景、环境声音和家庭活动分类等基于音频的任务中的性能。特别是，使用了卷积循环神经网络 (CRNN) 和预训练的卷积神经网络 (CNN)。CRNN 直接在音频样本的梅尔谱图上进行训练。对于预训练的 CNN，将各种架构的顶层之一的激活提取为特征向量，并用于训练线性支持向量机 (SVM)。此外，两个模型的预测——类概率预测为CRNN 和 SVM 的决策函数——在决策级融合中组合以实现最终预测。对于我们用作特征提取器的预训练 CNN 网络，我们进一步评估了一系列配置选项的效果，包括预训练语料库的选择。该系统在 IEEE AASP 声学场景和事件检测和分类挑战 (DCASE 2017) 研讨会、ESC-50 和 DCASE 2018 任务 5 的多通道声学录音的声学场景分类任务中进行了评估。我们的主要目标是分析所提出系统在不同数据集上的总体性能。我们表明，使用我们的系统，可以在所有数据集上实现具有竞争力的性能，并证明 CRNN 和 ImageNet 预训练 CNN 在声学分类任务中的互补性。我们进一步发现，在某些情况下，在 ImageNet 上预训练的 CNN 可以作为比 AudioSet 模型更强大的特征提取器。最后，ImageNet 预训练补充了更多特定领域的知识，无论是直接在目标数据上训练的卷积循环神经网络 (CRNN) 还是 AudioSet 预训练模型。在这方面，我们的研究结果表明将大型 CNN 的跨模态预训练应用于声学分析任务可能带来的好处。

更新日期：2020-12-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>