当前位置: X-MOL 学术arXiv.cs.MM › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action Localization
arXiv - CS - Multimedia Pub Date : 2021-06-27 , DOI: arxiv-2106.14118
Anurag Bagchi, Jazib Mahmood, Dolton Fernandes, Ravi Kiran Sarvadevabhatla

State of the art architectures for untrimmed video Temporal Action Localization (TAL) have only considered RGB and Flow modalities, leaving the information-rich audio modality totally unexploited. Audio fusion has been explored for the related but arguably easier problem of trimmed (clip-level) action recognition. However, TAL poses a unique set of challenges. In this paper, we propose simple but effective fusion-based approaches for TAL. To the best of our knowledge, our work is the first to jointly consider audio and video modalities for supervised TAL. We experimentally show that our schemes consistently improve performance for state of the art video-only TAL approaches. Specifically, they help achieve new state of the art performance on large-scale benchmark datasets - ActivityNet-1.3 (52.73 mAP@0.5) and THUMOS14 (57.18 mAP@0.5). Our experiments include ablations involving multiple fusion schemes, modality combinations and TAL architectures. Our code, models and associated data will be made available.

中文翻译:

听我说:音频增强时间动作定位的融合方法

未修剪视频时间动作定位 (TAL) 的最先进架构仅考虑了 RGB 和 Flow 模式,而信息丰富的音频模式则完全未开发。已经针对修剪(剪辑级别)动作识别的相关但可以说更容易的问题探索了音频融合。然而,TAL 带来了一系列独特的挑战。在本文中,我们为 TAL 提出了简单但有效的基于融合的方法。据我们所知,我们的工作是第一个联合考虑用于监督 TAL 的音频和视频模式的工作。我们通过实验表明,我们的方案始终如一地提高了最先进的仅视频 TAL 方法的性能。具体来说,它们有助于在大规模基准数据集上实现最先进的性能 - ActivityNet-1.3 (52.73 mAP@0.5) 和 THUMOS14 (57.18 mAP@0.5)。我们的实验包括涉及多种融合方案、模态组合和 TAL 架构的消融。我们的代码、模型和相关数据将可用。
更新日期:2021-06-29
down
wechat
bug