Pointly-Supervised Action Localization,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Pointly-Supervised Action Localization
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2018-09-11 , DOI: 10.1007/s11263-018-1120-4
Pascal Mettes , Cees G. M. Snoek

This paper strives for spatio-temporal localization of human actions in videos. In the literature, the consensus is to achieve localization by training on bounding box annotations provided for each frame of each training video. As annotating boxes in video is expensive, cumbersome and error-prone, we propose to bypass box-supervision. Instead, we introduce action localization based on point-supervision. We start from unsupervised spatio-temporal proposals, which provide a set of candidate regions in videos. While normally used exclusively for inference, we show spatio-temporal proposals can also be leveraged during training when guided by a sparse set of point annotations. We introduce an overlap measure between points and spatio-temporal proposals and incorporate them all into a new objective of a multiple instance learning optimization. During inference, we introduce pseudo-points, visual cues from videos, that automatically guide the selection of spatio-temporal proposals. We outline five spatial and one temporal pseudo-point, as well as a measure to best leverage pseudo-points at test time. Experimental evaluation on three action localization datasets shows our pointly-supervised approach (1) is as effective as traditional box-supervision at a fraction of the annotation cost, (2) is robust to sparse and noisy point annotations, (3) benefits from pseudo-points during inference, and (4) outperforms recent weakly-supervised alternatives. This leads us to conclude that points provide a viable alternative to boxes for action localization.

中文翻译：

点监督动作本地化

本文致力于视频中人类行为的时空定位。在文献中，共识是通过对每个训练视频的每一帧提供的边界框注释进行训练来实现定位。由于在视频中注释框昂贵、繁琐且容易出错，我们建议绕过框监督。相反，我们引入了基于点监督的动作定位。我们从无监督的时空提议开始，它在视频中提供了一组候选区域。虽然通常专门用于推理，但我们展示了当由一组稀疏点注释引导时，也可以在训练期间利用时空建议。我们在点和时空建议之间引入了重叠度量，并将它们全部合并到多实例学习优化的新目标中。在推理过程中，我们引入了伪点，即来自视频的视觉线索，可自动引导时空提议的选择。我们概述了五个空间和一个时间伪点，以及在测试时最好地利用伪点的措施。对三个动作定位数据集的实验评估表明，我们的点监督方法 (1) 与传统的框监督一样有效，而注释成本的一小部分，(2) 对稀疏和嘈杂的点注释具有鲁棒性，(3) 受益于伪- 推理过程中的点，并且（4）优于最近的弱监督替代方案。这使我们得出结论，点为动作定位框提供了一种可行的替代方案。我们概述了五个空间和一个时间伪点，以及在测试时最好地利用伪点的措施。对三个动作定位数据集的实验评估表明，我们的点监督方法 (1) 与传统的框监督一样有效，而注释成本的一小部分，(2) 对稀疏和嘈杂的点注释具有鲁棒性，(3) 受益于伪- 推理过程中的点，并且（4）优于最近的弱监督替代方案。这使我们得出结论，点为动作定位框提供了一种可行的替代方案。我们概述了五个空间和一个时间伪点，以及在测试时最好地利用伪点的措施。对三个动作定位数据集的实验评估表明，我们的点监督方法 (1) 与传统的框监督一样有效，而注释成本的一小部分，(2) 对稀疏和嘈杂的点注释具有鲁棒性，(3) 受益于伪- 推理过程中的点，并且（4）优于最近的弱监督替代方案。这使我们得出结论，点为动作定位框提供了一种可行的替代方案。对三个动作定位数据集的实验评估表明，我们的点监督方法 (1) 与传统的框监督一样有效，而注释成本的一小部分，(2) 对稀疏和嘈杂的点注释具有鲁棒性，(3) 受益于伪- 推理过程中的点，并且（4）优于最近的弱监督替代方案。这使我们得出结论，点为动作定位框提供了一种可行的替代方案。对三个动作定位数据集的实验评估表明，我们的点监督方法 (1) 与传统的框监督一样有效，而注释成本的一小部分，(2) 对稀疏和嘈杂的点注释具有鲁棒性，(3) 受益于伪- 推理过程中的点，并且（4）优于最近的弱监督替代方案。这使我们得出结论，点为动作定位框提供了一种可行的替代方案。

更新日期：2018-09-11

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11