Visual Experience-Based Question Answering with Complex Multimodal Environments,Mathematical Problems in Engineering

当前位置： X-MOL 学术 › Math. Probl. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Visual Experience-Based Question Answering with Complex Multimodal Environments
Mathematical Problems in Engineering Pub Date : 2020-11-19 , DOI: 10.1155/2020/8567271
Incheol Kim ₁

Affiliation

This paper proposes a novel visual experience-based question answering problem (VEQA) and the corresponding dataset for embodied intelligence research that requires an agent to do actions, understand 3D scenes from successive partial input images, and answer natural language questions about its visual experiences in real time. Unlike the conventional visual question answering (VQA), the VEQA problem assumes both partial observability and dynamics of a complex multimodal environment. To address this VEQA problem, we propose a hybrid visual question answering system, VQAS, integrating a deep neural network-based scene graph generation model and a rule-based knowledge reasoning system. The proposed system can generate more accurate scene graphs for dynamic environments with some uncertainty. Moreover, it can answer complex questions through knowledge reasoning with rich background knowledge. Results of experiments using a photo-realistic 3D simulated environment, AI2-THOR, and the VEQA benchmark dataset prove the high performance of the proposed system.

中文翻译：

复杂多模式环境中基于视觉体验的问答

本文提出了一种新颖的基于视觉体验的问答系统（VEQA）以及用于体现智能研究的相应数据集，该数据集要求智能体进行动作，从连续的部分输入图像中了解3D场景，并回答有关其视觉体验的自然语言问题。即时的。与常规的视觉问题解答（VQA）不同，VEQA问题假定部分可观察性和复杂多模式环境的动态。为了解决这个VEQA问题，我们提出了一种混合的视觉问题解答系统VQAS，它集成了基于深度神经网络的场景图生成模型和基于规则的知识推理系统。所提出的系统可以在不确定的情况下为动态环境生成更准确的场景图。此外，它可以通过具有丰富背景知识的知识推理来回答复杂的问题。使用逼真的3D模拟环境AI2-THOR和VEQA基准数据集进行的实验结果证明了该系统的高性能。

更新日期：2020-11-21

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11