SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions,arXiv - CS - Artificial Intelligence

当前位置： X-MOL 学术 › arXiv.cs.AI › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions
arXiv - CS - Artificial Intelligence Pub Date : 2020-01-20 , DOI: arxiv-2001.06927
Ramprasaath R. Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Ribeiro, Besmira Nushi, Ece Kamar

Existing VQA datasets contain questions with varying levels of complexity. While the majority of questions in these datasets require perception for recognizing existence, properties, and spatial relationships of entities, a significant portion of questions pose challenges that correspond to reasoning tasks - tasks that can only be answered through a synthesis of perception and knowledge about the world, logic and / or reasoning. Analyzing performance across this distinction allows us to notice when existing VQA models have consistency issues; they answer the reasoning questions correctly but fail on associated low-level perception questions. For example, in Figure 1, models answer the complex reasoning question "Is the banana ripe enough to eat?" correctly, but fail on the associated perception question "Are the bananas mostly green or yellow?" indicating that the model likely answered the reasoning question correctly but for the wrong reason. We quantify the extent to which this phenomenon occurs by creating a new Reasoning split of the VQA dataset and collecting VQA-introspect, a new dataset1 which consists of 238K new perception questions which serve as sub questions corresponding to the set of perceptual tasks needed to effectively answer the complex reasoning questions in the Reasoning split. Our evaluation shows that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems. To address this shortcoming, we propose an approach called Sub-Question Importance-aware Network Tuning (SQuINT), which encourages the model to attend to the same parts of the image when answering the reasoning question and the perception sub question. We show that SQuINT improves model consistency by ~5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.

中文翻译：

VQA 模型中的 SQuINTing：使用子问题自省 VQA 模型

现有的 VQA 数据集包含具有不同复杂程度的问题。虽然这些数据集中的大多数问题需要感知来识别实体的存在、属性和空间关系，但很大一部分问题提出了与推理任务相对应的挑战——这些任务只能通过对实体的感知和知识的综合来回答。世界、逻辑和/或推理。通过分析这种区别的性能，我们可以注意到现有 VQA 模型何时存在一致性问题；他们正确回答了推理问题，但在相关的低级感知问题上失败了。例如，在图 1 中，模型回答了复杂的推理问题“香蕉成熟可以吃了吗？” 正确，但在相关的感知问题上失败“ 这鼓励模型在回答推理问题和感知子问题时注意图像的相同部分。我们表明，SQuINT 将模型一致性提高了约 5%，也略微提高了 VQA 中推理问题的性能，同时还显示了更好的注意力图。

更新日期：2020-06-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文