Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
International Journal of Computer Vision ( IF 19.5 ) Pub Date : 2018-09-11 , DOI: 10.1007/s11263-018-1116-0
Yash Goyal , Tejas Khot , Aishwarya Agrawal , Douglas Summers-Stay , Dhruv Batra , Devi Parikh

The problem of visual question answering (VQA) is of significant importance both as a challenging research question and for the rich set of applications it enables. In this context, however, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in VQA models that ignore visual information, leading to an inflated sense of their capability. We propose to counter these language priors for the task of VQA and make vision (the V in VQA) matter! Specifically, we balance the popular VQA dataset (Antol et al., in: ICCV, 2015) by collecting complementary images such that every question in our balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Our dataset is by construction more balanced than the original VQA dataset and has approximately twice the number of image-question pairs. Our complete balanced dataset is publicly available at http://visualqa.org/ as part of the 2nd iteration of the VQA Dataset and Challenge (VQA v2.0). We further benchmark a number of state-of-art VQA models on our balanced dataset. All models perform significantly worse on our balanced dataset, suggesting that these models have indeed learned to exploit language priors. This finding provides the first concrete empirical evidence for what seems to be a qualitative sense among practitioners. We also present interesting insights from analysis of the participant entries in VQA Challenge 2017, organized by us on the proposed VQA v2.0 dataset. The results of the challenge were announced in the 2nd VQA Challenge Workshop at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017. Finally, our data collection protocol for identifying complementary images enables us to develop a novel interpretable model, which in addition to providing an answer to the given (image, question) pair, also provides a counter-example based explanation. Specifically, it identifies an image that is similar to the original image, but it believes has a different answer to the same question. This can help in building trust for machines among their users.

中文翻译：

使 VQA 中的 V 变得重要：提升图像理解在视觉问答中的作用

视觉问答 (VQA) 问题无论是作为一个具有挑战性的研究问题，还是对于它支持的丰富应用集都具有重要意义。然而，在这种情况下，我们世界的固有结构和语言的偏见往往是比视觉模式更简单的学习信号，导致 VQA 模型忽略视觉信息，导致对其能力的夸大其词。我们建议针对 VQA 的任务来对抗这些语言先验，并使视觉（VQA 中的 V）变得重要！具体来说，我们通过收集互补图像来平衡流行的 VQA 数据集（Antol 等人，在：ICCV，2015），这样我们平衡数据集中的每个问题不仅与单个图像相关，而且与结果相关的一对相似图像在这个问题的两个不同答案中。我们的数据集在构造上比原始 VQA 数据集更平衡，并且图像问题对的数量大约是其两倍。我们完整的平衡数据集在 http://visualqa.org/ 上公开提供，作为 VQA 数据集和挑战 (VQA v2.0) 的第二次迭代的一部分。我们在平衡数据集上进一步对许多最先进的 VQA 模型进行了基准测试。所有模型在我们的平衡数据集上的表现都明显更差，这表明这些模型确实已经学会了利用语言先验。这一发现为从业者似乎具有定性意义提供了第一个具体的经验证据。我们还通过对 VQA Challenge 2017 中参与者条目的分析提出了有趣的见解，该挑战由我们在提议的 VQA v2.0 数据集上组织。挑战的结果在 2017 年 IEEE 计算机视觉和模式识别会议 (CVPR) 的第二届 VQA 挑战研讨会上公布。最后，我们用于识别互补图像的数据收集协议使我们能够开发一种新颖的可解释模型，此外提供给定（图像，问题）对的答案，还提供了基于反例的解释。具体来说，它识别出与原始图像相似的图像，但它认为对同一问题有不同的答案。这有助于在用户之间建立对机器的信任。除了提供给定（图像，问题）对的答案外，它还提供了基于反例的解释。具体来说，它识别出与原始图像相似的图像，但它认为对同一问题有不同的答案。这有助于在用户之间建立对机器的信任。除了为给定的（图像，问题）对提供答案外，还提供了基于反例的解释。具体来说，它识别出与原始图像相似的图像，但它认为对同一问题有不同的答案。这有助于在用户之间建立对机器的信任。

更新日期：2018-09-11

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>