Recall What You See Continually Using GridLSTM in Image Captioning,IEEE Transactions on Multimedia

当前位置： X-MOL 学术 › IEEE Trans. Multimedia › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Recall What You See Continually Using GridLSTM in Image Captioning
IEEE Transactions on Multimedia ( IF 8.4 ) Pub Date : 2020-03-01 , DOI: 10.1109/tmm.2019.2931815
Lingxiang Wu , Min Xu , Jinqiao Wang , Stuart Perry

The goal of image captioning is to automatically describe an image with a sentence, and the task has attracted research attention from both the computer vision and natural-language processing research communities. The existing encoder–decoder model and its variants, which are the most popular models for image captioning, use the image features in three ways: first, they inject the encoded image features into the decoder only once at the initial step, which does not enable the rich image content to be explored sufficiently while gradually generating a text caption; second, they concatenate the encoded image features with text as extra inputs at every step, which introduces unnecessary noise; and, third, they using an attention mechanism, which increases the computational complexity due to the introduction of extra neural nets to identify the attention regions. Different from the existing methods, in this paper, we propose a novel network, Recall Network, for generating captions that are consistent with the images. The recall network selectively involves the visual features by using a GridLSTM and, thus, is able to recall image contents while generating each word. By importing the visual information as the latent memory along the depth dimension LSTM, the decoder is able to admit the visual features dynamically through the inherent LSTM structure without adding any extra neural nets or parameters. The Recall Network efficiently prevents the decoder from deviating from the original image content. To verify the efficiency of our model, we conducted exhaustive experiments on full and dense image captioning. The experimental results clearly demonstrate that our recall network outperforms the conventional encoder–decoder model by a large margin and that it performs comparably to the state-of-the-art methods.

中文翻译：

在图像字幕中使用 GridLSTM 不断回忆你所看到的

图像字幕的目标是用句子自动描述图像，该任务引起了计算机视觉和自然语言处理研究界的研究关注。现有的编码器 - 解码器模型及其变体是最流行的图像字幕模型，它们以三种方式使用图像特征：首先，它们在初始步骤仅将编码图像特征注入解码器一次，这无法启用在逐步生成文字说明的同时，充分挖掘丰富的图像内容；其次，他们在每一步都将编码的图像特征与文本作为额外输入连接起来，这会引入不必要的噪声；第三，他们使用注意力机制，由于引入了额外的神经网络来识别注意力区域，这增加了计算复杂度。与现有方法不同，在本文中，我们提出了一种新颖的网络 Recall Network，用于生成与图像一致的字幕。召回网络通过使用 GridLSTM 选择性地涉及视觉特征，因此能够在生成每个单词的同时召回图像内容。通过沿深度维度 LSTM 导入视觉信息作为潜在记忆，解码器能够通过固有的 LSTM 结构动态地接纳视觉特征，而无需添加任何额外的神经网络或参数。召回网络有效地防止解码器偏离原始图像内容。为了验证我们模型的效率，我们对完整和密集的图像字幕进行了详尽的实验。实验结果清楚地表明，我们的召回网络在很大程度上优于传统的编码器 - 解码器模型，并且它的性能与最先进的方法相当。

更新日期：2020-03-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11