当前位置:
X-MOL 学术
›
arXiv.cs.MM
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Temporal Contrastive Graph for Self-supervised Video Representation Learning
arXiv - CS - Multimedia Pub Date : 2021-01-04 , DOI: arxiv-2101.00820 Yang Liu, Keze Wang, Haoyuan Lan, Liang Lin
arXiv - CS - Multimedia Pub Date : 2021-01-04 , DOI: arxiv-2101.00820 Yang Liu, Keze Wang, Haoyuan Lan, Liang Lin
Attempt to fully explore the fine-grained temporal structure and global-local
chronological characteristics for self-supervised video representation
learning, this work takes a closer look at exploiting the temporal structure of
videos and further proposes a novel self-supervised method named Temporal
Contrastive Graph (TCG). In contrast to the existing methods that randomly
shuffle the video frames or video snippets within a video, our proposed TCG
roots in a hybrid graph contrastive learning strategy to regard the
inter-snippet and intra-snippet temporal relationships as self-supervision
signals for temporal representation learning. Inspired by the neuroscience
studies that the human visual system is sensitive to both local and global
temporal changes, our proposed TCG integrates the prior knowledge about the
frame and snippet orders into temporal contrastive graph structures, i.e., the
intra-/inter- snippet temporal contrastive graph modules, to well preserve the
local and global temporal relationships among video frame-sets and snippets. By
randomly removing edges and masking node features of the intra-snippet graphs
or inter-snippet graphs, our TCG can generate different correlated graph views.
Then, specific contrastive losses are designed to maximize the agreement
between node embeddings in different views. To learn the global context
representation and recalibrate the channel-wise features adaptively, we
introduce an adaptive video snippet order prediction module, which leverages
the relational knowledge among video snippets to predict the actual snippet
orders. Extensive experimental results demonstrate the superiority of our TCG
over the state-of-the-art methods on large-scale action recognition and video
retrieval benchmarks.
中文翻译:
自监督视频表示学习的时间对比图
试图充分探索用于自监督视频表示学习的细粒度时间结构和全局局部时间特征,这项工作将密切关注如何利用视频的时间结构,并进一步提出一种新的自监督方法,称为时间对比图(TCG)。与现有的随机混合视频中视频帧或视频片段的方法相反,我们提出的TCG源自混合图对比学习策略,将片段间和片段内时间关系视为用于时间表示的自我监督信号学习。受神经科学研究的启发,人类视觉系统对局部和全局时间变化都敏感,我们提出的TCG将有关帧和片段顺序的先验知识整合到时间对比图结构中,即片段内/片段间时间对比图模块,以很好地保留视频帧集和片段之间的局部和全局时间关系。通过随机删除片段内图或片段间图的边缘并掩盖节点特征,我们的TCG可以生成不同的相关图视图。然后,设计特定的对比损失以最大化不同视图中节点嵌入之间的一致性。为了学习全局上下文表示并自适应地重新校准通道级特征,我们引入了自适应视频片段顺序预测模块,该模块利用视频片段之间的关系知识来预测实际的片段顺序。
更新日期:2021-01-05
中文翻译:
自监督视频表示学习的时间对比图
试图充分探索用于自监督视频表示学习的细粒度时间结构和全局局部时间特征,这项工作将密切关注如何利用视频的时间结构,并进一步提出一种新的自监督方法,称为时间对比图(TCG)。与现有的随机混合视频中视频帧或视频片段的方法相反,我们提出的TCG源自混合图对比学习策略,将片段间和片段内时间关系视为用于时间表示的自我监督信号学习。受神经科学研究的启发,人类视觉系统对局部和全局时间变化都敏感,我们提出的TCG将有关帧和片段顺序的先验知识整合到时间对比图结构中,即片段内/片段间时间对比图模块,以很好地保留视频帧集和片段之间的局部和全局时间关系。通过随机删除片段内图或片段间图的边缘并掩盖节点特征,我们的TCG可以生成不同的相关图视图。然后,设计特定的对比损失以最大化不同视图中节点嵌入之间的一致性。为了学习全局上下文表示并自适应地重新校准通道级特征,我们引入了自适应视频片段顺序预测模块,该模块利用视频片段之间的关系知识来预测实际的片段顺序。