Discriminative Cross-Modality Attention Network for Temporal Inconsistent Audio-Visual Event Localization,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Discriminative Cross-Modality Attention Network for Temporal Inconsistent Audio-Visual Event Localization
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2021-09-03 , DOI: 10.1109/tip.2021.3106814
Hanyu Xuan , Lei Luo , Zhenyu Zhang , Jian Yang , Yan Yan

It is theoretically insufficient to construct a complete set of semantics in the real world using single-modality data. As a typical application of multi-modality perception, the audio-visual event localization task aims to match audio and visual components to identify the simultaneous events of interest. Although some recent methods have been proposed to deal with this task, they cannot handle the practical situation of temporal inconsistency that is widespread in the audio-visual scene. Inspired by the human system which automatically filters out event-unrelated information when performing multi-modality perception, we propose a discriminative cross-modality attention network to simulate such a process. Similar to human mechanism, our network can adaptively select “where” to attend, “when” to attend and “which” to attend for audio-visual event localization. In addition, to prevent our network from getting trivial solutions, a novel eigenvalue-based objective function is proposed to train the whole network to better fuse audio and visual signals, which can obtain discriminative and nonlinear multi-modality representation. In this way, even with large temporal inconsistency between audio and visual sequence, our network is able to adaptively select event-valuable information for audio-visual event localization. Furthermore, we systemically investigate three subtasks of audio-visual event localization, i.e., temporal localization, weakly-supervised spatial localization and cross-modality localization. The visualization results also help us better understand how our network works.

中文翻译：

用于时间不一致视听事件定位的判别式跨模态注意网络

使用单模态数据在现实世界中构建一套完整的语义在理论上是不够的。作为多模态感知的典型应用，视听事件定位任务旨在匹配音频和视觉组件以识别感兴趣的同时事件。尽管最近提出了一些方法来处理此任务，但它们无法处理视听场景中普遍存在的时间不一致的实际情况。受到人类系统在执行多模态感知时自动过滤掉与事件无关的信息的启发，我们提出了一种判别性跨模态注意网络来模拟这样的过程。与人类机制类似，我们的网络可以自适应地选择“哪里”参加、“何时”参加以及“哪些”参加以进行视听事件本地化。此外，为了防止我们的网络获得琐碎的解决方案，提出了一种新颖的基于特征值的目标函数来训练整个网络以更好地融合音频和视觉信号，从而获得有区别的非线性多模态表示。通过这种方式，即使音频和视觉序列之间存在较大的时间不一致，我们的网络也能够自适应地选择事件有价值的信息来进行视听事件定位。此外，我们系统地研究了视听事件定位的三个子任务，即时间定位、弱监督空间定位和跨模态定位。可视化结果还可以帮助我们更好地了解我们的网络是如何工作的。

更新日期：2021-09-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11