当前位置: X-MOL 学术arXiv.cs.MM › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Temporal Contrastive Graph for Self-supervised Video Representation Learning
arXiv - CS - Multimedia Pub Date : 2021-01-04 , DOI: arxiv-2101.00820
Yang Liu, Keze Wang, Haoyuan Lan, Liang Lin

Attempt to fully explore the fine-grained temporal structure and global-local chronological characteristics for self-supervised video representation learning, this work takes a closer look at exploiting the temporal structure of videos and further proposes a novel self-supervised method named Temporal Contrastive Graph (TCG). In contrast to the existing methods that randomly shuffle the video frames or video snippets within a video, our proposed TCG roots in a hybrid graph contrastive learning strategy to regard the inter-snippet and intra-snippet temporal relationships as self-supervision signals for temporal representation learning. Inspired by the neuroscience studies that the human visual system is sensitive to both local and global temporal changes, our proposed TCG integrates the prior knowledge about the frame and snippet orders into temporal contrastive graph structures, i.e., the intra-/inter- snippet temporal contrastive graph modules, to well preserve the local and global temporal relationships among video frame-sets and snippets. By randomly removing edges and masking node features of the intra-snippet graphs or inter-snippet graphs, our TCG can generate different correlated graph views. Then, specific contrastive losses are designed to maximize the agreement between node embeddings in different views. To learn the global context representation and recalibrate the channel-wise features adaptively, we introduce an adaptive video snippet order prediction module, which leverages the relational knowledge among video snippets to predict the actual snippet orders. Extensive experimental results demonstrate the superiority of our TCG over the state-of-the-art methods on large-scale action recognition and video retrieval benchmarks.

中文翻译:

自监督视频表示学习的时间对比图

试图充分探索用于自监督视频表示学习的细粒度时间结构和全局局部时间特征,这项工作将密切关注如何利用视频的时间结构,并进一步提出一种新的自监督方法,称为时间对比图(TCG)。与现有的随机混合视频中视频帧或视频片段的方法相反,我们提出的TCG源自混合图对比学习策略,将片段间和片段内时间关系视为用于时间表示的自我监督信号学习。受神经科学研究的启发,人类视觉系统对局部和全局时间变化都敏感,我们提出的TCG将有关帧和片段顺序的先验知识整合到时间对比图结构中,即片段内/片段间时间对比图模块,以很好地保留视频帧集和片段之间的局部和全局时间关系。通过随机删除片段内图或片段间图的边缘并掩盖节点特征,我们的TCG可以生成不同的相关图视图。然后,设计特定的对比损失以最大化不同视图中节点嵌入之间的一致性。为了学习全局上下文表示并自适应地重新校准通道级特征,我们引入了自适应视频片段顺序预测模块,该模块利用视频片段之间的关系知识来预测实际的片段顺序。
更新日期:2021-01-05
down
wechat
bug