Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.
IEEE Transactions on Image Processing ( IF 10.6 ) Pub Date : 2020-04-27 , DOI: 10.1109/tip.2020.2988435
Junchao Zhang , Yuxin Peng

Video captioning is a significant challenging task in computer vision and natural language processing, aiming to automatically describe video content by natural language sentences. Comprehensive understanding of video is the key for accurate video captioning, which needs to not only capture the global content and salient objects in video, but also understand the spatio-temporal relations of objects, including their temporal trajectories and spatial relationships. Thus, it is important for video captioning to capture the objects' relationships both within and across frames. Therefore, in this paper, we propose an object-aware spatio-temporal graph (OSTG) approach for video captioning. It constructs spatio-temporal graphs to depict objects with their relations, where the temporal graphs represent objects' inter-frame dynamics, and the spatial graphs represent objects' intra-frame interactive relationships. The main novelties and advantages are: (1) Bidirectional temporal alignment: Bidirectional temporal graph is constructed along and reversely along the temporal order to perform bidirectional temporal alignment for objects across different frames, which provides complementary clues to capture the inter-frame temporal trajectories for each salient object. (2) Graph based spatial relation learning: Spatial relation graph is constructed among objects in each frame by considering their relative spatial locations and semantic correlations, which is exploited to learn relation features that encode intra-frame relationships for salient objects. (3) Object-aware feature aggregation: Trainable VLAD (vector of locally aggregated descriptors) models are deployed to perform object-aware feature aggregation on objects' local features, which learn discriminative aggregated representations for better video captioning. A hierarchical attention mechanism is also developed to distinguish contributions of different object instances. Experiments on two widely-used datasets, MSR-VTT and MSVD, demonstrate our proposed approach achieves state-of-the-art performances in terms of BLEU@4, METEOR and CIDEr metrics.

中文翻译：

具有对象感知的时空相关和聚合的视频字幕。

视频字幕是计算机视觉和自然语言处理中一项具有挑战性的重大任务，旨在通过自然语言句子自动描述视频内容。全面了解视频是准确进行视频字幕的关键，它不仅需要捕获视频中的全局内容和显着对象，还需要了解对象的时空关系，包括其时间轨迹和空间关系。因此，对于视频字幕来说，捕获帧内和帧间对象的关系很重要。因此，在本文中，我们提出了一种用于视频字幕的对象感知的时空图（OSTG）方法。它构建时空图来描述对象及其关系，其中时图代表对象的帧间动态，空间图表示对象的帧内交互关系。主要的新颖性和优点是：（1）双向时间对齐：双向时间图是沿着时间顺序并沿时间顺序反向构造的，以便跨不同帧对对象执行双向时间对齐，这为捕获帧间时间轨迹提供了补充线索。每个显着对象。（2）基于图的空间关系学习：通过考虑对象的相对空间位置和语义相关性，在每一帧对象之间构造空间关系图，利用空间关系图学习对显着对象进行帧内关系编码的关系特征。（3）感知对象特征聚合：部署了可训练的VLAD（局部聚集描述符的向量）模型，以对对象的局部特征执行对象感知的特征聚集，该模型可学习区分性聚集表示形式，以实现更好的视频字幕。还开发了一种分级注意机制来区分不同对象实例的贡献。在两个广泛使用的数据集MSR-VTT和MSVD上进行的实验表明，我们提出的方法在BLEU @ 4，METEOR和CIDEr指标方面达到了最先进的性能。

更新日期：2020-04-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>