Integrating Scene Semantic Knowledge into Image Captioning,ACM Transactions on Multimedia Computing, Communications, and Applications

当前位置： X-MOL 学术 › ACM Trans. Multimed. Comput. Commun. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Integrating Scene Semantic Knowledge into Image Captioning
ACM Transactions on Multimedia Computing, Communications, and Applications ( IF 5.1 ) Pub Date : 2021-05-12 , DOI: 10.1145/3439734
Haiyang Wei ₁ , Zhixin Li ₁ , Feicheng Huang ₁ , Canlong Zhang ₁ , Huifang Ma ₂ , Zhongzhi Shi ₃

Affiliation

Most existing image captioning methods use only the visual information of the image to guide the generation of captions, lack the guidance of effective scene semantic information, and the current visual attention mechanism cannot adjust the focus intensity on the image. In this article, we first propose an improved visual attention model. At each timestep, we calculated the focus intensity coefficient of the attention mechanism through the context information of the model, then automatically adjusted the focus intensity of the attention mechanism through the coefficient to extract more accurate visual information. In addition, we represented the scene semantic knowledge of the image through topic words related to the image scene, then added them to the language model. We used the attention mechanism to determine the visual information and scene semantic information that the model pays attention to at each timestep and combined them to enable the model to generate more accurate and scene-specific captions. Finally, we evaluated our model on Microsoft COCO (MSCOCO) and Flickr30k standard datasets. The experimental results show that our approach generates more accurate captions and outperforms many recent advanced models in various evaluation metrics.

中文翻译：

将场景语义知识集成到图像描述中

现有的大多数图像字幕方法仅使用图像的视觉信息来指导字幕的生成，缺乏有效的场景语义信息的指导，并且当前的视觉注意机制无法调整图像上的焦点强度。在本文中，我们首先提出了一种改进的视觉注意模型。在每个时间步，我们通过模型的上下文信息计算注意力机制的焦点强度系数，然后通过系数自动调整注意力机制的焦点强度，以提取更准确的视觉信息。此外，我们通过与图像场景相关的主题词来表示图像的场景语义知识，然后将它们添加到语言模型中。我们使用注意力机制来确定模型在每个时间步所关注的视觉信息和场景语义信息，并将它们结合起来，使模型能够生成更准确和场景特定的字幕。最后，我们在 Microsoft COCO (MSCOCO) 和 Flickr30k 标准数据集上评估了我们的模型。实验结果表明，我们的方法生成更准确的字幕，并在各种评估指标中优于许多最近的高级模型。

更新日期：2021-05-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>