当前位置: X-MOL 学术Inf. Process. Manag. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Pyramid regional graph representation learning for content-based video retrieval
Information Processing & Management ( IF 8.6 ) Pub Date : 2021-01-11 , DOI: 10.1016/j.ipm.2020.102488
Guoping Zhao , Mingyu Zhang , Yaxian Li , Jiajun Liu , Bingqing Zhang , Ji-Rong Wen

Conventionally, it is common that video retrieval methods aggregate the visual feature representations from every frame as the feature of the video, where each frame is treated as an isolated, static image. Such methods lack the power of modeling the intra-frame and inter-frame relationships for the local regions, and are often vulnerable to the visual redundancy and noise caused by various types of video transformation and editing, such as adding image patches, adding banner, etc. From the perspective of video retrieval, a video’s key information is more often than not convoyed by geometrically centered, dynamic visual content, and static areas often reside in regions that are farther from the center and often exhibit heavy visual redundancies temporally. This phenomenon is hardly investigated by conventional retrieval methods.

In this article, we propose an unsupervised video retrieval method that simultaneously models intra-frame and inter-frame contextual information for video representation with a graph topology that is constructed on top of pyramid regional feature maps. By decomposing a frame into a pyramid regional sub-graph, and transforming a video into a regional graph, we use graph convolutional networks to extract features that incorporate information from multiple types of context. Our method is unsupervised and only uses the frame features extracted by pre-trained network. We have conducted extensive experiments and have demonstrated that the proposed method outperforms state-of-the-art video retrieval methods.



中文翻译:

基于内容的视频检索的金字塔区域图表示学习

常规上,视频检索方法通常将来自每一帧的视觉特征表示聚合为视频的特征,其中将每一帧视为独立的静态图像。这样的方法缺乏为局部区域建模帧内和帧间关系的能力,并且通常容易受到各种类型的视频转换和编辑(例如添加图像补丁,添加横幅,从视频检索的角度来看,视频的关键信息通常是由几何居中的动态视觉内容所影响的,而静态区域通常位于距离中心较远的区域,并且在时间上经常显示出大量的视觉冗余。常规检索方法几乎不研究这种现象。

在本文中,我们提出了一种无监督视频检索方法,该方法同时利用在金字塔区域特征图之上构建的图形拓扑对帧内和帧间上下文信息进行建模,以进行视频表示。通过将帧分解为金字塔区域子图,并将视频转换为区域图,我们使用图卷积网络来提取合并了来自多种上下文的信息的特征。我们的方法是无监督的,仅使用预训练网络提取的帧特征。我们进行了广泛的实验,并证明了该方法优于最新的视频检索方法。

更新日期:2021-01-12
down
wechat
bug