Mutual information guided 3D ResNet for self-supervised video representation learning,IET Image Processing

当前位置： X-MOL 学术 › IET Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Mutual information guided 3D ResNet for self-supervised video representation learning
IET Image Processing ( IF 2.0 ) Pub Date : 2020-11-30 , DOI: 10.1049/iet-ipr.2020.0019
Fei Xue ₁ , Hongbing Ji ₁ , Wenbo Zhang ₁

Affiliation

In this work, the authors propose a novel self-supervised learning method based on mutual information to learn representations from the videos without manual annotation. Different video clips sampled from the same video usually have coherence in the temporal domain. To guide the network to learn such temporal coherence, they maximise the mutual information between global features extracted from different clips sampled from the same video (Global-MI). However, maximising the Global-MI leads the network to seek shared content from different video clips and may make the network degenerate to focus on the background of the video. Considering the structure of the video, they further maximise the average mutual information between the global feature and local patches of multiple regions of the video clip (multi-region Local-MI). Their approach, which is called Max-GL, learns the temporal coherence by jointly maximising the Global-MI and multi-region Local-MI. Experiments are conducted to validate the effectiveness of the proposed Max-GL. Experimental results show that the Max-GL can serve as an effective pre-training method for the task of action recognition in videos. Additional experiments for the task of action similarity labelling and dynamic scene recognition also validate the generalisation of the learned representations of the Max-GL.

中文翻译：

相互信息指导的3D ResNet用于自我监督的视频表示学习

在这项工作中，作者提出了一种基于互信息的新颖的自我监督学习方法，可以从视频中学习表示形式而无需人工注释。从同一视频采样的不同视频片段通常在时域上具有连贯性。为了指导网络学习这种时间上的连贯性，它们最大化了从从同一视频（Global-MI）采样的不同剪辑中提取的全局特征之间的相互信息。但是，最大化Global-MI会使网络从不同的视频剪辑中寻找共享内容，并且可能使网络退化以专注于视频的背景。考虑到视频的结构，它们进一步使全局特征与视频剪辑的多个区域（多区域Local-MI）的局部补丁之间的平均互信息最大化。他们的方法称为Max-GL的语言通过共同最大化Global-MI和多区域Local-MI来学习时间相干性。进行实验以验证所提出的Max-GL的有效性。实验结果表明，Max-GL可以作为视频中动作识别任务的有效预训练方法。动作相似性标记和动态场景识别任务的其他实验也验证了Max-GL所学表示的一般性。

更新日期：2020-12-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11