当前位置: X-MOL 学术IEEE Trans. Image Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Modeling Sub-Actions for Weakly Supervised Temporal Action Localization
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2021-05-13 , DOI: 10.1109/tip.2021.3078324
Linjiang Huang , Yan Huang , Wanli Ouyang , Liang Wang

As a challenging task of high-level video understanding, weakly supervised temporal action localization has attracted more attention recently. Due to the usage of video-level category labels, this task is usually formulated as the task of classification, which always suffers from the contradiction between classification and detection. In this paper, we describe a novel approach to alleviate the contradiction for detecting more complete action instances by explicitly modeling sub-actions. Our method makes use of three innovations to model the latent sub-actions. First, our framework uses prototypes to represent sub-actions, which can be automatically learned in an end-to-end way. Second, we regard the relations among sub-actions as a graph, and construct the correspondences between sub-actions and actions by the graph pooling operation. Doing so not only makes the sub-actions inter-dependent to facilitate the multi-label setting, but also naturally use the video-level labels as weak supervision. Third, we devise three complementary loss functions, namely, representation loss, balance loss and relation loss to ensure the learned sub-actions are diverse and have clear semantic meanings. Experimental results on THUMOS14 and ActivityNet1.3 datasets demonstrate the effectiveness of our method and superior performance over state-of-the-art approaches.

中文翻译:


弱监督时间动作定位的子动作建模



作为高级视频理解的一项具有挑战性的任务,弱监督时间动作定位最近引起了越来越多的关注。由于使用视频级类别标签,该任务通常被表述为分类任务,而分类任务始终面临着分类和检测之间的矛盾。在本文中,我们描述了一种新颖的方法,通过显式建模子动作来缓解检测更完整的动作实例的矛盾。我们的方法利用三项创新来模拟潜在的子动作。首先,我们的框架使用原型来表示子动作,可以以端到端的方式自动学习。其次,我们将子动作之间的关系视为一个图,并通过图池化操作构建子动作和动作之间的对应关系。这样做不仅使子动作相互依赖以方便多标签设置,而且自然地使用视频级标签作为弱监督。第三,我们设计了三个互补的损失函数,即表示损失、平衡损失和关系损失,以确保学习到的子动作是多样的并且具有明确的语义。 THUMOS14 和 ActivityNet1.3 数据集上的实验结果证明了我们的方法的有效性以及优于最先进方法的性能。
更新日期:2021-05-13
down
wechat
bug