Enhancing the Alignment between Target Words and Corresponding Frames for Video Captioning,Pattern Recognition

当前位置： X-MOL 学术 › Pattern Recogn. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Enhancing the Alignment between Target Words and Corresponding Frames for Video Captioning
Pattern Recognition ( IF 7.5 ) Pub Date : 2021-03-01 , DOI: 10.1016/j.patcog.2020.107702
Yunbin Tu , Chang Zhou , Junjun Guo , Shengxiang Gao , Zhengtao Yu

Abstract Video captioning aims at translating from a sequence of video frames into a sequence of words with the encoder-decoder framework. Hence, it is critical to align these two different sequences. Most existing methods exploit soft-attention (temporal attention) mechanism to align target words with corresponding frames, where the relevance of them merely depends on the previously generated words (i.e., language context). As we know, however, there is an inherent gap between vision and language, and most of the words in a caption belong to non-visual words (e.g. “a”, “is”, and “in”). Hence, merely with the guidance of the language context, existing temporal attention-based methods cannot exactly align target words with corresponding frames. In order to address this problem, we first introduce pre-detected visual tags from the video to bridge the gap between vision and language. The reason is that visual tags not only belong to textual modality, but also can convey visual information. Then, we present a Textual-Temporal Attention Model (TTA) to exactly align the target words with corresponding frames. The experimental results show that our proposed method outperforms the state-of-the-art methods on two well known datasets, i.e., MSVD and MSR-VTT. 1

中文翻译：

增强目标词与视频字幕对应帧之间的对齐

摘要视频字幕旨在使用编码器-解码器框架将视频帧序列转换为单词序列。因此，对齐这两个不同的序列至关重要。大多数现有方法利用软注意力（时间注意力）机制将目标词与相应的框架对齐，其中它们的相关性仅取决于先前生成的词（即语言上下文）。然而，正如我们所知，视觉和语言之间存在着内在的鸿沟，标题中的大部分词都属于非视觉词（例如“a”、“is”和“in”）。因此，仅在语言上下文的指导下，现有的基于时间注意力的方法无法将目标词与相应的框架精确对齐。为了解决这个问题，我们首先从视频中引入预先检测到的视觉标签，以弥合视觉和语言之间的差距。原因是视觉标签不仅属于文本模态，还可以传达视觉信息。然后，我们提出了一个文本-时间注意模型（TTA）来精确地将目标词与相应的帧对齐。实验结果表明，我们提出的方法在两个众所周知的数据集上优于最先进的方法，即 MSVD 和 MSR-VTT。1 实验结果表明，我们提出的方法在两个众所周知的数据集上优于最先进的方法，即 MSVD 和 MSR-VTT。1 实验结果表明，我们提出的方法在两个众所周知的数据集上优于最先进的方法，即 MSVD 和 MSR-VTT。1

更新日期：2021-03-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11