LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering
arXiv - CS - Computation and Language Pub Date : 2020-11-21 , DOI: arxiv-2011.10731
Weixin Liang, Feiyang Niu, Aishwarya Reganti, Govind Thattai, Gokhan Tur

The predominant approach to visual question answering (VQA) relies on encoding the image and question with a "black-box" neural encoder and decoding a single token as the answer like "yes" or "no". Despite this approach's strong quantitative results, it struggles to come up with intuitive, human-readable forms of justification for the prediction process. To address this insufficiency, we reformulate VQA as a full answer generation task, which requires the model to justify its predictions in natural language. We propose LRTA [Look, Read, Think, Answer], a transparent neural-symbolic reasoning framework for visual question answering that solves the problem step-by-step like humans and provides human-readable form of justification at each step. Specifically, LRTA learns to first convert an image into a scene graph and parse a question into multiple reasoning instructions. It then executes the reasoning instructions one at a time by traversing the scene graph using a recurrent neural-symbolic execution module. Finally, it generates a full answer to the given question with natural language justifications. Our experiments on GQA dataset show that LRTA outperforms the state-of-the-art model by a large margin (43.1% v.s. 28.0%) on the full answer generation task. We also create a perturbed GQA test set by removing linguistic cues (attributes and relations) in the questions for analyzing whether a model is having a smart guess with superficial data correlations. We show that LRTA makes a step towards truly understanding the question while the state-of-the-art model tends to learn superficial correlations from the training data.

中文翻译：

LRTA：具有可视化答疑的模块化监督的透明神经-符号推理框架

视觉问题解答（VQA）的主要方法依赖于使用“黑匣子”神经编码器对图像和问题进行编码，并解码单个令牌作为答案，例如“是”或“否”。尽管此方法具有很强的定量结果，但仍难以为预测过程提出直观，易于理解的辩解形式。为了解决这一不足，我们将VQA重新制定为完整的答案生成任务，这要求模型以自然语言证明其预测是正确的。我们提出了LRTA [Look，Read，Think，Answer]，这是一种用于视觉问题解答的透明神经符号推理框架，可以像人类一样逐步解决问题，并在每个步骤中提供人类可读的证明形式。特别，LRTA学习首先将图像转换为场景图，然后将问题解析为多个推理指令。然后，通过使用递归神经符号执行模块遍历场景图，一次执行一个推理指令。最后，它使用自然语言证明为给定问题生成完整答案。我们在GQA数据集上进行的实验表明，在完整的答案生成任务上，LRTA的性能远远优于最新模型（43.1％对28.0％）。我们还通过消除问题中的语言线索（属性和关系）来创建受干扰的GQA测试集，以分析模型是否具有与浅层数据相关的智能猜测。

更新日期：2020-11-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文