当前位置: X-MOL 学术J. Exp. Theor. Artif. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-feature fusion refine network for video captioning
Journal of Experimental & Theoretical Artificial Intelligence ( IF 2.2 ) Pub Date : 2021-02-23 , DOI: 10.1080/0952813x.2021.1883745
Guan-Hong Wang 1, 2, 3 , Ji-Xiang Du 1, 2, 3 , Hong-Bo Zhang 1, 2, 3
Affiliation  

ABSTRACT

Describing video content using natural language is an important part of video understanding. It needs to not only understand the spatial information on video, but also capture the motion information. Meanwhile, video captioning is a cross-modal problem between vision and language. Traditional video captioning methods follow the encoder-decoder framework that transfers the video to sentence. But the semantic alignment from sentence to video is ignored. Hence, finding a discriminative visual representation as well as narrowing the semantic gap between video and text has great influence on generating accurate sentences. In this paper, we propose an approach based on multi-feature fusion refine network (MFRN), which can not only capture the spatial information and motion information by exploiting multi-feature fusion, but also can get better semantic aligning of different models by designing a refiner to explore the sentence to video stream. The main novelties and advantages of our method are: (1) multi-feature fusion: Both two-dimension convolutional neural networks and three-dimension convolutional neural networks pre-trained on ImageNet and Kinetic respectively are used to construct spatial information and motion information, and then fused to get better visual representation. (2) Sematic alignment refiner: the refiner is designed to restrain the decoder and reproduce the video features to narrow semantic gap between different modal. Experiments on two widely used datasets demonstrate our approach achieves state-of-the-art performance in terms of BLEU@4, METEOR, ROUGE and CIDEr metrics.



中文翻译:

用于视频字幕的多特征融合细化网络

摘要

使用自然语言描述视频内容是视频理解的重要组成部分。它不仅需要了解视频的空间信息,还需要捕捉运动信息。同时,视频字幕是视觉和语言之间的跨模态问题。传统的视频字幕方法遵循将视频转换为句子的编码器-解码器框架。但是从句子到视频的语义对齐被忽略了。因此,找到有区别的视觉表示以及缩小视频和文本之间的语义差距对生成准确的句子有很大的影响。在本文中,我们提出了一种基于多特征融合细化网络(MFRN)的方法,它不仅可以通过利用多特征融合来捕获空间信息和运动信息,还可以通过设计细化器来探索句子到视频流,从而更好地对不同模型进行语义对齐。我们的方法的主要创新和优势是:(1)多特征融合:分别在 ImageNet 和 Kinetic 上预训练的二维卷积神经网络和三维卷积神经网络都用于构建空间信息和运动信息,然后融合以获得更好的视觉表现。(2)语义对齐细化器:细化器旨在约束解码器并再现视频特征以缩小不同模态之间的语义差距。在两个广泛使用的数据集上进行的实验表明,我们的方法在 BLEU@4、METEOR、ROUGE 和 CIDEr 指标方面实现了最先进的性能。

更新日期:2021-02-23
down
wechat
bug