Transferable Knowledge-Based Multi-Granularity Fusion Network for Weakly Supervised Temporal Action Detection,IEEE Transactions on Multimedia

当前位置： X-MOL 学术 › IEEE Trans. Multimedia › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Transferable Knowledge-Based Multi-Granularity Fusion Network for Weakly Supervised Temporal Action Detection
IEEE Transactions on Multimedia ( IF 8.4 ) Pub Date : 2020-06-01 , DOI: 10.1109/tmm.2020.2999184
Haisheng Su , Xu Zhao , Tianwei Lin , Shuming Liu , Zhilan Hu

Despite remarkable progress, temporal action detection is still limited for real application due to the great amount of manual annotations. This issue motivates interest in addressing this task under weak supervision, namely, locating the action instances using only video-level class labels. Many current works on this task are mainly based on the Class Activation Sequence (CAS), which is generated by the video classification network to describe the probability of each snippet being in a specific action class of the video. However, the CAS generated by a simple classification network can only focus on local discriminative parts instead of locating the entire interval of target actions. In this paper, we present a novel framework to handle this issue. Specifically, we propose to utilize convolutional kernels with varied dilation rates to enlarge the receptive fields, which can transfer the discriminative information to the surrounding non-discriminative regions. Then, we design a cascaded module with the proposed Online Adversarial Erasing (OAE) mechanism to further mine more relevant regions of target actions by feeding the erased-feature maps of discovered regions back into the system. In addition, inspired by the transfer learning method, we adopt an additional module to transfer the knowledge from trimmed videos to untrimmed videos to promote the classification performance on untrimmed videos. Finally, we employ a boundary regression module embedded with Outer-Inner-Contrastive (OIC) loss to automatically predict the boundaries based on the enhanced CAS. Extensive experiments are conducted on two challenging datasets, THUMOS14 and ActivityNet-1.3, and the experimental results clearly demonstrate the superiority of our unified framework.

中文翻译：

用于弱监督时间动作检测的基于可转移知识的多粒度融合网络

尽管取得了显着的进步，但由于大量的手动注释，时间动作检测在实际应用中仍然受到限制。这个问题激发了在弱监督下解决这个任务的兴趣，即仅使用视频级类标签定位动作实例。目前关于这个任务的许多工作主要基于类激活序列（CAS），它是由视频分类网络生成的，用于描述每个片段在视频的特定动作类中的概率。然而，简单的分类网络生成的 CAS 只能专注于局部判别性部分，而不是定位目标动作的整个区间。在本文中，我们提出了一个新的框架来处理这个问题。具体来说，我们建议利用具有不同膨胀率的卷积核来扩大接收场，这可以将区分性信息转移到周围的非区分区域。然后，我们设计了一个具有所提出的在线对抗性擦除 (OAE) 机制的级联模块，通过将发现区域的擦除特征图反馈回系统，进一步挖掘目标动作的更多相关区域。此外，受迁移学习方法的启发，我们采用了一个额外的模块将知识从修剪过的视频转移到未修剪过的视频中，以提高未修剪过的视频的分类性能。最后，我们采用嵌入了外内对比 (OIC) 损失的边界回归模块来基于增强的 CAS 自动预测边界。

更新日期：2020-06-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11