当前位置:
X-MOL 学术
›
arXiv.cs.AI
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions
arXiv - CS - Artificial Intelligence Pub Date : 2020-01-20 , DOI: arxiv-2001.06927 Ramprasaath R. Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Ribeiro, Besmira Nushi, Ece Kamar
arXiv - CS - Artificial Intelligence Pub Date : 2020-01-20 , DOI: arxiv-2001.06927 Ramprasaath R. Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Ribeiro, Besmira Nushi, Ece Kamar
Existing VQA datasets contain questions with varying levels of complexity.
While the majority of questions in these datasets require perception for
recognizing existence, properties, and spatial relationships of entities, a
significant portion of questions pose challenges that correspond to reasoning
tasks - tasks that can only be answered through a synthesis of perception and
knowledge about the world, logic and / or reasoning. Analyzing performance
across this distinction allows us to notice when existing VQA models have
consistency issues; they answer the reasoning questions correctly but fail on
associated low-level perception questions. For example, in Figure 1, models
answer the complex reasoning question "Is the banana ripe enough to eat?"
correctly, but fail on the associated perception question "Are the bananas
mostly green or yellow?" indicating that the model likely answered the
reasoning question correctly but for the wrong reason. We quantify the extent
to which this phenomenon occurs by creating a new Reasoning split of the VQA
dataset and collecting VQA-introspect, a new dataset1 which consists of 238K
new perception questions which serve as sub questions corresponding to the set
of perceptual tasks needed to effectively answer the complex reasoning
questions in the Reasoning split. Our evaluation shows that state-of-the-art
VQA models have comparable performance in answering perception and reasoning
questions, but suffer from consistency problems. To address this shortcoming,
we propose an approach called Sub-Question Importance-aware Network Tuning
(SQuINT), which encourages the model to attend to the same parts of the image
when answering the reasoning question and the perception sub question. We show
that SQuINT improves model consistency by ~5%, also marginally improving
performance on the Reasoning questions in VQA, while also displaying better
attention maps.
中文翻译:
VQA 模型中的 SQuINTing:使用子问题自省 VQA 模型
现有的 VQA 数据集包含具有不同复杂程度的问题。虽然这些数据集中的大多数问题需要感知来识别实体的存在、属性和空间关系,但很大一部分问题提出了与推理任务相对应的挑战——这些任务只能通过对实体的感知和知识的综合来回答。世界、逻辑和/或推理。通过分析这种区别的性能,我们可以注意到现有 VQA 模型何时存在一致性问题;他们正确回答了推理问题,但在相关的低级感知问题上失败了。例如,在图 1 中,模型回答了复杂的推理问题“香蕉成熟可以吃了吗?” 正确,但在相关的感知问题上失败“ 这鼓励模型在回答推理问题和感知子问题时注意图像的相同部分。我们表明,SQuINT 将模型一致性提高了约 5%,也略微提高了 VQA 中推理问题的性能,同时还显示了更好的注意力图。
更新日期:2020-06-17
中文翻译:
VQA 模型中的 SQuINTing:使用子问题自省 VQA 模型
现有的 VQA 数据集包含具有不同复杂程度的问题。虽然这些数据集中的大多数问题需要感知来识别实体的存在、属性和空间关系,但很大一部分问题提出了与推理任务相对应的挑战——这些任务只能通过对实体的感知和知识的综合来回答。世界、逻辑和/或推理。通过分析这种区别的性能,我们可以注意到现有 VQA 模型何时存在一致性问题;他们正确回答了推理问题,但在相关的低级感知问题上失败了。例如,在图 1 中,模型回答了复杂的推理问题“香蕉成熟可以吃了吗?” 正确,但在相关的感知问题上失败“ 这鼓励模型在回答推理问题和感知子问题时注意图像的相同部分。我们表明,SQuINT 将模型一致性提高了约 5%,也略微提高了 VQA 中推理问题的性能,同时还显示了更好的注意力图。