Two-Branch Relational Prototypical Network for Weakly Supervised Temporal Action Localization.,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Two-Branch Relational Prototypical Network for Weakly Supervised Temporal Action Localization.
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 2021-04-28 , DOI: 10.1109/tpami.2021.3076172
Linjiang Huang , Yan Huang , Wanli Ouyang , Liang Wang

As a challenging task of high-level video understanding, weakly supervised temporal action localization has attracted more attention recently. With only video-level category labels, this task should identify the background and actions frame by frame, however, it is non-trivial to achieve this, due to the unconstrained background, complex and multi-label actions. With the observation that these difficulties are mainly brought by the large variations within background and actions, we propose to address these challenges from the perspective of modeling variations. Moreover, it is desired to further reduce the variances, so as to cast the problem of background identification as rejecting background and alleviate the contradiction between classification and detection. Accordingly, in this paper, we propose a two-branch relational prototypical network. The first branch, namely action-branch, adopts class-wise prototypes and mainly acts as an auxiliary to introduce prior knowledge about label dependencies. Meanwhile, the second branch, sub-branch, starts with multiple prototypes, namely sub-prototypes, to enable a powerful ability to model variations. As a further benefit, we elaborately design a multi-label clustering loss based on the sub-prototypes to learn compact features under the multi-label setting. Extensive experiments on three datasets demonstrate the effectiveness of the proposed method and superior performance over state-of-the-art methods.

中文翻译：

两分支关系原型网络，用于弱监督的时间动作本地化。

作为高级别视频理解的一项艰巨任务，目前，弱监督的时间动作定位已引起了更多关注。仅使用视频级类别标签，此任务应逐帧标识背景和操作，但是，由于不受限制的背景，复杂和多标签的操作，实现这一点并非易事。观察到这些困难主要是由背景和动作中的巨大差异带来的，因此，我们建议从建模差异的角度来解决这些挑战。此外，期望进一步减小方差，从而将背景识别的问题归为拒绝背景，并减轻分类和检测之间的矛盾。因此，在本文中，我们提出了一个两分支的关系原型网络。第一个分支，即动作分支，采用类原型，主要用作介绍有关标签依赖关系的先验知识的辅助工具。同时，第二个分支（子分支）从多个原型（即子原型）开始，以实现强大的变体建模能力。作为进一步的好处，我们精心设计了基于子原型的多标签聚类损失，以学习多标签设置下的紧凑特征。在三个数据集上进行的大量实验证明了该方法的有效性和优于最新方法的性能。即子原型，从而具有强大的建模变体能力。作为进一步的好处，我们精心设计了基于子原型的多标签聚类损失，以学习多标签设置下的紧凑特征。在三个数据集上进行的大量实验证明了该方法的有效性和优于最新方法的性能。即子原型，从而具有强大的建模变体能力。作为进一步的好处，我们精心设计了基于子原型的多标签聚类损失，以学习多标签设置下的紧凑特征。在三个数据集上进行的大量实验证明了该方法的有效性和优于最新方法的性能。

更新日期：2021-04-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11