Image Captioning and Visual Question Answering Based on Attributes and External Knowledge,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Image Captioning and Visual Question Answering Based on Attributes and External Knowledge
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 23.6 ) Pub Date : 2017-05-26 , DOI: 10.1109/tpami.2017.2708709
Qi Wu , Chunhua Shen , Peng Wang , Anthony Dick , Anton van den Hengel

Much of the recent progress in Vision-to-Language problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to progress directly from image features to text. In this paper we first propose a method of incorporating high-level concepts into the successful CNN-RNN approach, and show that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering. We further show that the same mechanism can be used to incorporate external knowledge, which is critically important for answering high level visual questions. Specifically, we design a visual question answering model that combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. It particularly allows questions to be asked where the image alone does not contain the information required to select the appropriate answer. Our final model achieves the best reported results for both image captioning and visual question answering on several of the major benchmark datasets.

中文翻译：

基于属性和外部知识的图像字幕和视觉问答

通过将卷积神经网络（CNN）和递归神经网络（RNN）结合，可以实现视觉到语言问题的最新进展。这种方法没有明确表示高级语义概念，而是试图直接从图像特征发展为文本。在本文中，我们首先提出一种将高级概念整合到成功的CNN-RNN方法中的方法，并表明该方法在图像字幕和视觉问题解答方面都实现了最新技术的显着改进。我们进一步表明，可以使用相同的机制来整合外部知识，这对于回答高级视觉问题至关重要。具体来说，我们设计了一个视觉问题解答模型，该模型将图像内容的内部表示形式与从一般知识库中提取的信息相结合，以回答各种基于图像的问题。它特别允许在仅图像不包含选择适当答案所需信息的地方提出问题。我们的最终模型在几个主要基准数据集上的图像字幕和视觉问题回答方面均取得了最佳的报告结果。

更新日期：2018-05-05

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>