Unsupervised Visual Representation Learning by Tracking Patches in Video,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unsupervised Visual Representation Learning by Tracking Patches in Video
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-05-06 , DOI: arxiv-2105.02545
Guangting Wang, Yizhou Zhou, Chong Luo, Wenxuan Xie, Wenjun Zeng, Zhiwei Xiong

Inspired by the fact that human eyes continue to develop tracking ability in early and middle childhood, we propose to use tracking as a proxy task for a computer vision system to learn the visual representations. Modelled on the Catch game played by the children, we design a Catch-the-Patch (CtP) game for a 3D-CNN model to learn visual representations that would help with video-related tasks. In the proposed pretraining framework, we cut an image patch from a given video and let it scale and move according to a pre-set trajectory. The proxy task is to estimate the position and size of the image patch in a sequence of video frames, given only the target bounding box in the first frame. We discover that using multiple image patches simultaneously brings clear benefits. We further increase the difficulty of the game by randomly making patches invisible. Extensive experiments on mainstream benchmarks demonstrate the superior performance of CtP against other video pretraining methods. In addition, CtP-pretrained features are less sensitive to domain gaps than those trained by a supervised action recognition task. When both trained on Kinetics-400, we are pleasantly surprised to find that CtP-pretrained representation achieves much higher action classification accuracy than its fully supervised counterpart on Something-Something dataset. Code is available online: github.com/microsoft/CtP.

中文翻译：

通过跟踪视频中的补丁来进行无监督的视觉表示学习

受人眼在儿童早期和中期继续发展跟踪能力这一事实的启发，我们建议使用跟踪作为计算机视觉系统的代理任务，以学习视觉表示。以孩子们玩的Catch游戏为模型，我们为3D-CNN模型设计了Catch-the-Patch（CtP）游戏，以学习有助于与视频相关的任务的视觉表示。在提出的预训练框架中，我们从给定的视频中剪切图像补丁，并使其按照预设轨迹缩放和移动。代理任务是在给定第一帧中的目标边界框的情况下，估计一系列视频帧中图像块的位置和大小。我们发现同时使用多个图像补丁会带来明显的好处。通过随机使补丁不可见，我们进一步增加了游戏难度。在主流基准上进行的大量实验证明，CtP相对于其他视频预训练方法具有优越的性能。此外，与受监督的动作识别任务所训练的功能相比，CtP预先训练的功能对域差距的敏感度更低。当两个人都在Kinetics-400上进行训练时，我们惊喜地发现CtP预训练的表示比在Something-Something数据集中的完全受监督的对等表示具有更高的动作分类精度。可以在线获取代码：github.com/microsoft/CtP。与受监督的动作识别任务所训练的功能相比，CtP预先训练的功能对域差距的敏感度更低。当两个人都在Kinetics-400上进行训练时，我们惊喜地发现CtP预先训练的表示比在Something-Something数据集上的完全受监督的对等表示具有更高的动作分类精度。可以在线获取代码：github.com/microsoft/CtP。与受监督的动作识别任务所训练的功能相比，CtP预先训练的功能对域差距的敏感度更低。当两个人都在Kinetics-400上进行训练时，我们惊喜地发现CtP预先训练的表示比在Something-Something数据集上的完全受监督的对等表示具有更高的动作分类精度。可以在线获取代码：github.com/microsoft/CtP。

更新日期：2021-05-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文