当前位置: X-MOL 学术arXiv.cs.MM › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Squeeze-Excitation Convolutional Recurrent Neural Networks for Audio-Visual Scene Classification
arXiv - CS - Multimedia Pub Date : 2021-07-28 , DOI: arxiv-2107.13180
Javier Naranjo-Alcazar, Sergi Perez-Castanos, Aaron Lopez-Garcia, Pedro Zuccarello, Maximo Cobos, Francesc J. Ferri

The use of multiple and semantically correlated sources can provide complementary information to each other that may not be evident when working with individual modalities on their own. In this context, multi-modal models can help producing more accurate and robust predictions in machine learning tasks where audio-visual data is available. This paper presents a multi-modal model for automatic scene classification that exploits simultaneously auditory and visual information. The proposed approach makes use of two separate networks which are respectively trained in isolation on audio and visual data, so that each network specializes in a given modality. The visual subnetwork is a pre-trained VGG16 model followed by a bidiretional recurrent layer, while the residual audio subnetwork is based on stacked squeeze-excitation convolutional blocks trained from scratch. After training each subnetwork, the fusion of information from the audio and visual streams is performed at two different stages. The early fusion stage combines features resulting from the last convolutional block of the respective subnetworks at different time steps to feed a bidirectional recurrent structure. The late fusion stage combines the output of the early fusion stage with the independent predictions provided by the two subnetworks, resulting in the final prediction. We evaluate the method using the recently published TAU Audio-Visual Urban Scenes 2021, which contains synchronized audio and video recordings from 12 European cities in 10 different scene classes. The proposed model has been shown to provide an excellent trade-off between prediction performance (86.5%) and system complexity (15M parameters) in the evaluation results of the DCASE 2021 Challenge.

中文翻译:

用于视听场景分类的挤压激励卷积循环神经网络

使用多个语义相关的源可以相互提供补充信息,这些信息在单独使用单个模式时可能并不明显。在这种情况下,多模态模型可以帮助在有视听数据的机器学习任务中产生更准确、更稳健的预测。本文提出了一种用于自动场景分类的多模态模型,该模型同时利用听觉和视觉信息。所提出的方法利用两个独立的网络,这两个网络分别在音频和视频数据上进行隔离训练,以便每个网络专门研究给定的模态。视觉子网络是一个预训练的 VGG16 模型,后跟一个双向循环层,而残差音频子网络基于从头开始训练的堆叠挤压激励卷积块。在训练每个子网络之后,来自音频和视觉流的信息的融合在两个不同的阶段进行。早期融合阶段将各个子网络的最后一个卷积块在不同时间步长产生的特征结合起来,以提供双向循环结构。后期融合阶段将早期融合阶段的输出与两个子网络提供的独立预测相结合,得出最终预测。我们使用最近发布的 TAU Audio-Visual Urban Scenes 2021 评估该方法,其中包含来自 12 个欧洲城市的 10 个不同场景类别的同步音频和视频记录。
更新日期:2021-07-29
down
wechat
bug