Self-supervised video representation learning by maximizing mutual information,Signal Processing: Image Communication

当前位置： X-MOL 学术 › Signal Process. Image Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Self-supervised video representation learning by maximizing mutual information
Signal Processing: Image Communication ( IF 3.5 ) Pub Date : 2020-08-12 , DOI: 10.1016/j.image.2020.115967
Fei Xue , Hongbing Ji , Wenbo Zhang , Yi Cao

We address the problem of learning representations from the videos without manual annotation. Different video clips sampled from the same video usually have a similar background and consistent motion. A novel self-supervised task is designed to learn such temporal coherence, which is measured by the mutual information in our work. First, we maximize the mutual information between features extracted from the clips which are sampled from the same video. This encourages the network to learn the shared content by these clips. As a result, the network may focus on the background and ignore the motion in videos due to that different clips from the same video normally have the same background. Second, to address this issue, we simultaneously maximize the mutual information between the feature of the video clip and the local regions where salient motion exists. Our approach, which is referred to as Deep Video Infomax (DVIM), strikes a balance between the background and the motion when learning the temporal coherence. We conduct extensive experiments to test the performance of the proposed DVIM on various tasks. Experimental results of fine-tuning for the high-level action recognition problems validate the effectiveness of the learned representations. Additional experiments for the task of action similarity labeling also demonstrate the generalization of the learned representations of the DVIM.

中文翻译：

通过最大化相互信息进行自我监督的视频表示学习

我们解决了从视频中学习表示形式而无需人工注释的问题。从同一视频采样的不同视频片段通常具有相似的背景和一致的运动。一种新颖的自我监督任务旨在学习这种时间上的连贯性，这种连贯性是通过我们工作中的相互信息来衡量的。首先，我们最大化从片段中提取的特征之间的相互信息，这些片段是从同一视频中采样的。这鼓励网络通过这些剪辑来学习共享内容。结果，由于来自同一视频的不同剪辑通常具有相同的背景，因此网络可能会专注于背景，而忽略视频中的运动。第二，要解决这个问题，我们同时最大化了视频剪辑的特征与存在明显运动的局部区域之间的相互信息。我们的方法称为深度视频Infomax（DVIM），在学习时间相干性时，可以在背景和运动之间达到平衡。我们进行了广泛的实验，以测试建议的DVIM在各种任务上的性能。对高级动作识别问题进行微调的实验结果验证了学习表示的有效性。动作相似性标记任务的其他实验也证明了DVIM的学习表示具有普遍性。我们进行了广泛的实验，以测试建议的DVIM在各种任务上的性能。对高级动作识别问题进行微调的实验结果验证了学习表示的有效性。动作相似性标记任务的其他实验也证明了DVIM的学习表示具有普遍性。我们进行了广泛的实验，以测试建议的DVIM在各种任务上的性能。对高级动作识别问题进行微调的实验结果验证了学习表示的有效性。动作相似性标记任务的其他实验也证明了DVIM的学习表示具有普遍性。

更新日期：2020-08-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>