Boost image captioning with knowledge reasoning,Machine Learning

当前位置： X-MOL 学术 › Mach. Learn. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Boost image captioning with knowledge reasoning
Machine Learning ( IF 7.5 ) Pub Date : 2020-10-27 , DOI: 10.1007/s10994-020-05919-y
Feicheng Huang , Zhixin Li , Haiyang Wei , Canlong Zhang , Huifang Ma

Automatically generating a human-like description for a given image is a potential research in artificial intelligence, which has attracted a great of attention recently. Most of the existing attention methods explore the mapping relationships between words in sentence and regions in image, such unpredictable matching manner sometimes causes inharmonious alignments that may reduce the quality of generated captions. In this paper, we make our efforts to reason about more accurate and meaningful captions. We first propose word attention to improve the correctness of visual attention when generating sequential descriptions word-by-word. The special word attention emphasizes on word importance when focusing on different regions of the input image, and makes full use of the internal annotation knowledge to assist the calculation of visual attention. Then, in order to reveal those incomprehensible intentions that cannot be expressed straightforwardly by machines, we introduce a new strategy to inject external knowledge extracted from knowledge graph into the encoder-decoder framework to facilitate meaningful captioning. Finally, we validate our model on two freely available captioning benchmarks: Microsoft COCO dataset and Flickr30k dataset. The results demonstrate that our approach achieves state-of-the-art performance and outperforms many of the existing approaches.

中文翻译：

使用知识推理提升图像字幕

自动为给定图像生成类人描述是人工智能的一项潜在研究，最近引起了极大的关注。现有的注意力方法大多探索句子中单词和图像中区域之间的映射关系，这种不可预测的匹配方式有时会导致不协调的对齐，从而可能降低生成字幕的质量。在本文中，我们努力推理更准确和更有意义的字幕。我们首先提出词注意来提高逐字生成顺序描述时视觉注意的正确性。特殊词注意力在关注输入图像的不同区域时强调词的重要性，并充分利用内部注释知识来辅助视觉注意力的计算。然后，为了揭示那些无法由机器直接表达的难以理解的意图，我们引入了一种新策略，将从知识图中提取的外部知识注入到编码器-解码器框架中，以促进有意义的字幕。最后，我们在两个免费提供的字幕基准上验证了我们的模型：Microsoft COCO 数据集和 Flickr30k 数据集。结果表明，我们的方法达到了最先进的性能，并且优于许多现有方法。我们在两个免费提供的字幕基准测试中验证了我们的模型：Microsoft COCO 数据集和 Flickr30k 数据集。结果表明，我们的方法达到了最先进的性能，并且优于许多现有方法。我们在两个免费提供的字幕基准测试中验证了我们的模型：Microsoft COCO 数据集和 Flickr30k 数据集。结果表明，我们的方法达到了最先进的性能，并且优于许多现有方法。

更新日期：2020-10-27

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>