TCLR: Temporal Contrastive Learning for Video Representation,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

TCLR: Temporal Contrastive Learning for Video Representation
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-01-20 , DOI: arxiv-2101.07974
Ishan Dave, Rohit Gupta, Mamshad Nayeem Rizve, Mubarak Shah

Contrastive learning has nearly closed the gap between supervised and self-supervised learning of image representations. Existing extensions of contrastive learning to the domain of video data however, rely on naive transposition of ideas from image-based methods and do not fully utilize the temporal dimension present in video. We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods. The first loss adds the task of discriminating between non-overlapping clips from the same video, whereas the second loss aims to discriminate between timesteps of the feature map of an input clip in order to increase the temporal diversity of the features. Temporal contrastive learning achieves significant improvement over the state-of-the-art results in downstream video understanding tasks such as action recognition, limited-label action classification, and nearest-neighbor video retrieval on video datasets across multiple 3D CNN architectures. With the commonly used 3D-ResNet-18 architecture, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification, and 56.2% (+11.7% increase) Top-1 Recall on UCF101 nearest neighbor video retrieval.

中文翻译：

TCLR：用于视频表示的时间对比学习

对比学习几乎弥合了图像表示的监督学习和自我监督学习之间的差距。然而，现有的将对比学习扩展到视频数据领域的方法依赖于基于图像的方法的思想的天真转换，并且没有充分利用视频中存在的时间维度。我们开发了一个由两个新的损失组成的新的时间对比学习框架，以改进现有的对比自我监督视频表示学习方法。第一个损失增加了区分同一视频的非重叠片段的任务，而第二个损失旨在区分输入片段的特征图的时间步长，从而增加了特征的时间多样性。时间对比学习相对于下游视频理解任务（例如动作识别，有限标签动作分类以及跨多个3D CNN架构的视频数据集的最近邻视频检索）的最新技术成果有了显着改善。使用常用的3D-ResNet-18架构，我们在UCF101上的top-1精度达到了82.4％（比以前的最佳增长+ 5.1％），在HMDB51动作分类上达到了52.9％（+ 5.4％增长），以及56.2％（+ UCF101最近邻居视频检索的Top-1调用增加了11.7％。

更新日期：2021-01-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>