当前位置: X-MOL 学术IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Two-Branch Relational Prototypical Network for Weakly Supervised Temporal Action Localization
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 2021-04-28 , DOI: 10.1109/tpami.2021.3076172
Linjiang Huang , Yan Huang , Wanli Ouyang , Liang Wang

As a challenging task of high-level video understanding, weakly supervised temporal action localization has attracted more attention recently. With only video-level category labels, this task should indistinguishably identify the background and action categories frame by frame. However, it is non-trivial to achieve this in untrimmed videos, due to the unconstrained background, complex and multi-label actions. With the observation that these difficulties are mainly brought by the large variations within background and actions, we propose to address these challenges from the perspective of modeling variations. Moreover, it is desired to further reduce the variations, or learn compact features, so as to cast the problem of background identification as rejecting background and alleviate the contradiction between classification and detection. Accordingly, in this paper, we propose a two-branch relational prototypical network. The first branch, namely action-branch, adopts class-wise prototypes and mainly acts as an auxiliary to introduce priori knowledge about label dependencies and be a guide for the second branch. Meanwhile, the second branch, namely sub-branch, starts with multiple prototypes, namely sub-prototypes, to enable a powerful ability of modeling variations. As a further benefit, we elaborately design a multi-label clustering loss based on the sub-prototypes to learn compact features under the multi-label setting. The two branches are associated using the correspondences between two types of prototypes, leading to a special two-stage classifier in the s-branch, on the other hand, the two branches serve as regularization terms to each other, improving the final performance. Ablation studies find that the proposed model is capable of modeling classes with large variations and learning compact features. Extensive experimental evaluations on Thumos14, MultiThumos and ActivityNet datasets demonstrate the effectiveness of the proposed method and superior performance over state-of-the-art approaches.

中文翻译:


用于弱监督时间动作定位的两分支关系原型网络



作为高级视频理解的一项具有挑战性的任务,弱监督时间动作定位最近引起了越来越多的关注。仅使用视频级类别标签,此任务应该无法区分地逐帧识别背景和动作类别。然而,由于不受约束的背景、复杂且多标签的动作,在未修剪的视频中实现这一点并非易事。观察到这些困难主要是由背景和动作的巨大变化带来的,我们建议从建模变化的角度来解决这些挑战。此外,还希望进一步减少变化,或者学习紧凑的特征,从而将背景识别问题转化为拒绝背景问题,缓解分类和检测之间的矛盾。因此,在本文中,我们提出了一种两分支关系原型网络。第一个分支,即action-branch,采用class-wise原型,主要作为辅助引入标签依赖关系的先验知识,并为第二个分支提供指导。同时,第二个分支,即子分支,从多个原型(即子原型)开始,以实现强大的变化建模能力。作为进一步的好处,我们根据子原型精心设计了多标签聚类损失,以在多标签设置下学习紧凑的特征。两个分支利用两种类型原型之间的对应关系进行关联,从而在 s 分支中产生特殊的两级分类器,另一方面,两个分支相互充当正则化项,从而提高了最终性能。 消融研究发现,所提出的模型能够对变化较大的类进行建模并学习紧凑的特征。对 Thumos14、MultiThumos 和 ActivityNet 数据集的广泛实验评估证明了所提出方法的有效性以及优于最先进方法的性能。
更新日期:2021-04-28
down
wechat
bug