当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering
arXiv - CS - Computation and Language Pub Date : 2020-11-21 , DOI: arxiv-2011.10731
Weixin Liang, Feiyang Niu, Aishwarya Reganti, Govind Thattai, Gokhan Tur

The predominant approach to visual question answering (VQA) relies on encoding the image and question with a "black-box" neural encoder and decoding a single token as the answer like "yes" or "no". Despite this approach's strong quantitative results, it struggles to come up with intuitive, human-readable forms of justification for the prediction process. To address this insufficiency, we reformulate VQA as a full answer generation task, which requires the model to justify its predictions in natural language. We propose LRTA [Look, Read, Think, Answer], a transparent neural-symbolic reasoning framework for visual question answering that solves the problem step-by-step like humans and provides human-readable form of justification at each step. Specifically, LRTA learns to first convert an image into a scene graph and parse a question into multiple reasoning instructions. It then executes the reasoning instructions one at a time by traversing the scene graph using a recurrent neural-symbolic execution module. Finally, it generates a full answer to the given question with natural language justifications. Our experiments on GQA dataset show that LRTA outperforms the state-of-the-art model by a large margin (43.1% v.s. 28.0%) on the full answer generation task. We also create a perturbed GQA test set by removing linguistic cues (attributes and relations) in the questions for analyzing whether a model is having a smart guess with superficial data correlations. We show that LRTA makes a step towards truly understanding the question while the state-of-the-art model tends to learn superficial correlations from the training data.

中文翻译:

LRTA:具有可视化答疑的模块化监督的透明神经-符号推理框架

视觉问题解答(VQA)的主要方法依赖于使用“黑匣子”神经编码器对图像和问题进行编码,并解码单个令牌作为答案,例如“是”或“否”。尽管此方法具有很强的定量结果,但仍难以为预测过程提出直观,易于理解的辩解形式。为了解决这一不足,我们将VQA重新制定为完整的答案生成任务,这要求模型以自然语言证明其预测是正确的。我们提出了LRTA [Look,Read,Think,Answer],这是一种用于视觉问题解答的透明神经符号推理框架,可以像人类一样逐步解决问题,并在每个步骤中提供人类可读的证明形式。特别,LRTA学习首先将图像转换为场景图,然后将问题解析为多个推理指令。然后,通过使用递归神经符号执行模块遍历场景图,一次执行一个推理指令。最后,它使用自然语言证明为给定问题生成完整答案。我们在GQA数据集上进行的实验表明,在完整的答案生成任务上,LRTA的性能远远优于最新模型(43.1%对28.0%)。我们还通过消除问题中的语言线索(属性和关系)来创建受干扰的GQA测试集,以分析模型是否具有与浅层数据相关的智能猜测。
更新日期:2020-11-25
down
wechat
bug