Hierarchical LSTMs with Adaptive Attention for Visual Captioning,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Hierarchical LSTMs with Adaptive Attention for Visual Captioning
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 1-22-2019 , DOI: 10.1109/tpami.2019.2894139
Lianli Gao , Xiangpeng Li , Jingkuan Song , Heng Tao Shen

Recent progress has been made in using attention based encoder-decoder framework for image and video captioning. Most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., “gun” and “shooting”) and non-visual words (e.g., “the”, “a”). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of visual captioning. Furthermore, the hierarchy of LSTMs enables more complex representation of visual data, capturing information at different scales. Considering these issues, we propose a hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning. Specifically, the proposed framework utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information. Also, a hierarchical LSTMs is designed to simultaneously consider both low-level visual information and high-level language context information to support the caption generation. We design the hLSTMat model as a general framework, and we first instantiate it for the task of video captioning. Then, we further instantiate our hLSTMarefine it and apply it to the imioning task. To demonstrate the effectiveness of our proposed framework, we test our method on both video and image captioning tasks. Experimental results show that our approach achieves the state-of-the-art performance for most of the evaluation metrics on both tasks. The effect of important components is also well exploited in the ablation study.

中文翻译：

具有自适应注意力的分层 LSTM 用于视觉字幕

最近在使用基于注意力的编码器-解码器框架进行图像和视频字幕方面取得了进展。大多数现有解码器将注意力机制应用于每个生成的单词，包括视觉单词（例如“枪”和“射击”）和非视觉单词（例如“the”、“a”）。然而，这些非视觉单词可以使用自然语言模型轻松预测，而无需考虑视觉信号或注意力。对非视觉单词施加注意机制可能会误导并降低视觉字幕的整体性能。此外，LSTM 的层次结构可以实现更复杂的视觉数据表示，捕获不同尺度的信息。考虑到这些问题，我们提出了一种用于图像和视频字幕的具有自适应注意力的分层 LSTM (hLSTMat) 方法。具体来说，所提出的框架利用空间或时间注意力来选择特定区域或帧来预测相关单词，而自适应注意力用于决定是否依赖于视觉信息或语言上下文信息。此外，分层 LSTM 被设计为同时考虑低级视觉信息和高级语言上下文信息以支持字幕生成。我们将 hLSTMat 模型设计为通用框架，首先将其实例化用于视频字幕任务。然后，我们进一步实例化我们的 hLSTMarefine 并将其应用到 imioning 任务中。为了证明我们提出的框架的有效性，我们在视频和图像字幕任务上测试了我们的方法。实验结果表明，我们的方法在这两项任务的大多数评估指标上都实现了最先进的性能。消融研究中也充分利用了重要成分的影响。

更新日期：2024-08-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11