Abstract

This paper proposes a novel visual experience-based question answering problem (VEQA) and the corresponding dataset for embodied intelligence research that requires an agent to do actions, understand 3D scenes from successive partial input images, and answer natural language questions about its visual experiences in real time. Unlike the conventional visual question answering (VQA), the VEQA problem assumes both partial observability and dynamics of a complex multimodal environment. To address this VEQA problem, we propose a hybrid visual question answering system, VQAS, integrating a deep neural network-based scene graph generation model and a rule-based knowledge reasoning system. The proposed system can generate more accurate scene graphs for dynamic environments with some uncertainty. Moreover, it can answer complex questions through knowledge reasoning with rich background knowledge. Results of experiments using a photo-realistic 3D simulated environment, AI2-THOR, and the VEQA benchmark dataset prove the high performance of the proposed system.

1. Introduction

With the rapid developments in the deep learning technology, high-level image understanding problems are gaining increasing attention in computer-vision communities [1, 2]. Visual question answering (VQA) [3] is one of the most actively researched image understanding problems that requires the generation of correct answers to natural language questions about input images, as illustrated in Figure 1(a). This is a complex intelligence problem that requires both image and natural language understanding abilities.

However, existing VQA problems have a few limitations. First, it is difficult to understand the three-dimensional (3D) configuration of the entire environment because only one input image is provided with no distinction between indoor and outdoor surroundings. Consequently, the scope of questions covers only a part of the environment in space or time. Furthermore, many questions do not require commonsense or background knowledge and can be sufficiently answered by the understanding of the input image and question. These VQA problems do not consider the agent’s body in the environment or interactions between the agent and environment. Therefore, unlike the real world, it is impossible to ask questions about the state of an agent when acquiring the input image and about the environmental change after the agent performs a specific action.

To overcome these limitations of the existing VQA problems, the present study proposes a visual experience-based question answering (VEQA) problem and a question answering system (VQAS) to solve the problem. The VEQA can be considered a type of an embodied VQA (EVQA) problem that has recently begun to be researched in the computer-vision field [4, 5]. The conventional EVQA problems assume that environmental changes do not exist except for changes in the agent position; however, in this study, the authors assumed that environmental changes are possible not only by the positional movement actions of the agent but also by manipulative actions such as picking up bread and opening the refrigerator. Furthermore, in the existing EVQA problems, the agent must plan the movement actions that it will perform to obtain the answer. However, the new VEQA problem is different from the EVQA problem in that the agent must perform a series of actions that have been planned in advance and must answer questions about environmental state changes that it has experienced. Moreover, in the existing EVQA problems, no model about the agent’s action is assumed, whereas in the VEQA problems the agent is assumed to have a probabilistic model about its actions in advance.

Figure 1(b) illustrates the proposed VEQA. As shown, while performing a series of actions , the agent observes input images of the environment in the visible range. Furthermore, the range of questions, , given at time is limited to 3D configurations and the environmental state experienced by the agent through the input images until that time. In this example, immediately after performing action  = “Pick up Bread,” correct answer to question  = “What is the thing the agent has” is “Bread.” Furthermore, the VEQA problem requires separate commonsense knowledge in addition to proper understanding of the changing environmental state. For example, in Figure 1(b), to answer question  = How many fruits are on the kitchen table?,” a separate commonsensical knowledge that the observed object “Apple” is a type of “Fruit” is required. The environmental states for question answering are represented in the 3D scene graph, as shown in Figure 2. The 3D scene graph is a knowledge graph consisting of objects in a 3D environment, object attributes, and spatial relationships between objects. To solve the VEQA, generation of the 3D scene graph is needed. Therefore, the answer to the question is made based on the 3D scene graph. Consequently, the VEQA is a problem of generating 3D scene graphs from the agent’s visual experiences and answering the questions based on these 3D scene graphs.

Unlike the conventional VQA problem, the VEQA problem addressed in this study is designed for embodied intelligence in a photo-realistic 3D simulated environment close to the real world, which requires an agent to do navigations and actions, understand the entire 3D scenes from successive partial input images, and answer questions about the dynamic scenes. To deal with the VEQA problem, we suggest a structured representation to express the visual scene understanding of the dynamic 3D environment with a series of 3D scene graphs. A 3D scene graph captures objects and their spatial relationships in a given scene of the environment. The structured information represented in scene graph is useful for downstream tasks such as visual question answering. However, most existing scene graph generation models [69] generate a 2D scene graph from a single image. Therefore, they cannot model the entire 3D scene of the environment from a sequence of partial images, nor represent dynamic changes of scenes caused by the agent’s navigation and action. To overcome such limitations of the conventional models, we propose a novel 3D scene graph generation model which can generate a series of 3D scene graphs from a sequence of partial input images in a dynamic environment. This model consists of the state recognition module and the state prediction module. While the former recognizes environmental states from input images based on the trained deep neural network, the latter predicts environmental changes caused by the agent’s actions using the rule-based action models.

To address the VEQA problem, we also propose a knowledge reasoning system to answer questions about dynamic scenes of the environment based on 3D scene graphs. Some VEQA questions require deep background knowledge such as object hierarchies, which is beyond the shallow knowledge contained in 3D scene graphs generated on the fly from input images. However, the conventional visual question answering models based on pure deep neural network [10, 11] are not easy to utilize the structural information of 3D scene graphs. Moreover, the models are also hard to make use of background/prior knowledge of the environment. Different from the pure deep neural network-based models, the proposed knowledge reasoning system can use a rich knowledge source to answer questions by combining the shallow knowledge in 3D scene graphs with a large amount of prebuilt deep background knowledge.

The contributions of this paper are summarized as follows:(1)We propose a novel VEQA problem and the corresponding dataset for embodied intelligence research that requires an agent to do actions, understand entire 3D scenes from successive partial input images, and answer natural language questions about the dynamic scenes in a photo-realistic 3D simulated environment.(2)To address the VEQA problem, we propose a hybrid visual question answering system, VQAS, integrating a deep neural network-based scene graph generation model and a rule-based knowledge reasoning system.(3)We propose a novel 3D scene graph generation (SGG) model which can generate a series of 3D scene graphs from successive partial input images in a dynamic environment. The proposed model can overcome the limitation of the conventional scene graph generation models building just a 2D scene graph from a single still image. The model also meets well the partial observability and dynamics of the VEQA environment.(4)We also propose a knowledge reasoning system to answer natural language questions based on 3D scene graphs. Different from the pure deep neural network-based models, the proposed knowledge reasoning system can use a rich knowledge source to answer questions by combining the shallow knowledge in 3D scene graphs with a large amount of prebuilt deep background knowledge.(5)The high performance of the proposed VQAS system is verified through a series of experiments using a photo-realistic 3D simulated environment, AI2-THOR, and the VEQA benchmark dataset.

2.1. Scene Graph Generation

Scene graph generation involves the expression of scenes in images as knowledge in graph form [12, 13]. A scene graph generally consists of nodes representing the objects in the image and edges representing the relationships between objects [14]. This scene graph can be effectively used in problems that require high-level image understanding such as image captioning and generation [15, 16].

Existing relevant works have generated 2D scene graphs to express objects in a single image and the relationships between the objects by matching them to a 2D space [69]. These works sequentially performed image-based object detection and relationship recognition between the objects and then expressed the results in one scene graph. In the object detection process, the faster R-CNN [17] was mainly used to recognize the position and class of objects in images based on the convolutional neural network. In the relationship recognition process, various features about the object region were used to determine the relationships between the detected objects. For accurate relationship recognition, researchers used various features, such as visual and spatial features, of each object. However, they did not examine 3D scene graphs that can express the 3D position of objects in images and their 3D spatial relationships.

Contrary to existing scene graphs studies, visual graphs from motion (VGfM) [18] generated a 3D scene graph consisting of 3D objects from multiple images and their spatial relationships. This study performed 2D object detection from input images and calculated the 3D object region by using the positions of objects in the images and the observer’s pose. Further, the object classes and spatial relationships between the objects were derived from the features of the objects appearing in multiple images through the recurrent neural network. However, this method has the following constraints: all the objects to be recognized in each image must be captured and each image must represent the same environmental state. Therefore, VGfM [18] cannot be applied to partially observable and dynamically changing environmental conditions such as those in VEQA. To overcome these limitations, the present study proposes a method of correctly expressing the dynamic environmental state by expanding and upgrading the 3D scene graph according to the input of partial images.

2.2. Visual Question Answering

Existing studies on VQA proposed various deep neural network models to solve problems [4, 5]. Many existing studies extracted visual features by applying the convolutional neural network and extracted linguistic features by applying the recurrent neural network for natural language questions. Then, they combined these two features by applying an appropriate attention mechanism and trained it in an end-to-end method to generate correct answers to questions [1924]. Some studies attempted to generate higher-quality answers by extracting high-level semantic features from input images [2527]. Anderson et al. [25] conducted object detection in advance for input images to determine which object region to focus on in the answer-generation stage. Teney et al. [26] used graph-structured representation for VQA to detect objects in input images and determined similarity between the detected objects and the words constituting natural language questions. Through this process, they tried to generate answers by using the object features with high similarity with the question.

However, these VQA models only train the scenes in the input images as implicit knowledge embedded in the deep neural network. Therefore, the implicit knowledge obtained from images in this manner is difficult for humans to understand compared to explicit knowledge in symbolic logic or graph form. Furthermore, explaining the validity of the generated answer is difficult. Moreover, combining such knowledge with various external knowledge bases and using them for more in-depth questions that require separate commonsense knowledge is also difficult. To overcome these limitations, the present study proposes a method of generating a scene graph, which comprises explicit scene knowledge, from the images and using the graph in answering questions. This method can verify the validity of the answer through scene knowledge that can be understood by humans. Furthermore, it is capable of answering questions that require separate commonsense knowledge because it can combine scene knowledge with an external knowledge base.

2.3. Knowledge-Based VQA

Unlike the conventional VQA, Wang et al. [28] proposed a fact-based VQA (FVQA) problem that always requires separate commonsense knowledge to answer questions. To solve an FVQA problem, the image understanding result and external knowledge base must be used together to answer a question.

Wang et al. [28] tried to use only the knowledge that is directly related to the image from the knowledge defined in the external knowledge base. To that end, they obtained commonsense knowledge related to the image by searching the result of image understanding by using a deep neural network in the external knowledge base. In addition, they reasoned the answer to a question based on the obtained commonsense knowledge. However, it was difficult to answer questions that only required simple image understanding because the model could only generate answers based on commonsense knowledge. To overcome this limitation, Narasimhan and Schwing [29] proposed a VQA model which decides between scene and external knowledge to answer a question by applying the recurrent neural network to the question. However, this method is limited in that it only depends on one of the two knowledge types and cannot use a combination of the two knowledges. To obtain an answer by using both knowledge types simultaneously, Narasimhan et al. [30] combined the knowledge types into one graph and then extracted features from the graph by using the graph convolutional network and generated an answer by using the extracted features.

However, the previous studies performed only low-level image understanding to generate scene knowledge and are not appropriate for question answering problems that must consider relationships between objects, for example, “What things are in the refrigerator?” Furthermore, they only selected a simple recognition result or one of the known factors through an external knowledge base to answer a question. Therefore, the existing models can only answer questions that ask simple facts including object and place, such as “What is the plant-eating animal shown here?” and cannot answer questions that require complex reasoning such as “How many fruits are on the kitchen table?” To overcome this limitation, the present study generates a 3D scene graph composed of objects in the environment, attributes, and spatial relationships between objects. In addition, to answer questions that require complex reasoning, the proposed method expresses the scene graph and external knowledge base based on ontology and generates an answer that considers both knowledge types together through knowledge reasoning.

3. Visual Experience-Based Question Answering

3.1. Problem Description

The proposed VEQA is a VQA problem about the agent’s visual experience in a 3D simulated environment. The VEQA problem has several assumptions different from the conventional VQA problems. The given environment is only partially observable according to the visible range of the agent, and the environmental state can be changed by executing the agent’s actions. Furthermore, the environmental change caused by the agent’s actions may have some uncertainty. The natural language questions are limited to the visual experience of the agent.

The VEQA problem is expressed by , where is a set of actions to be taken by an agent in a photo-realistic 3D simulated environment, is a set of natural language questions , is a set of answers , is a set of observed RGB-D images , and is a set of scene graphs that represent the estimated environmental states based on the agent’s visual experience history until time . The goal of the VEQA problem is to make an answer to the given question based on the scene graph .

Specifically, the agent can take one of twelve different actions in the 3D simulated environment: eight agent pose-changing actions of “Move Ahead/Back/Left/Right,” “Rotate Left/Right,” and “Look Up/Down,” and four interaction actions of “Open/Close Object,” “Pick Up Object,” and “Put Down Object.” Among these actions, for the object interaction actions, the target object class is provided together with the action type, such as “Pick up Bread,” as shown in Figure 1(b). The natural language questions are largely classified into six types: questions asking the existence of a specific object “Is there a potato somewhere in the room?”, the number of a specific object “How many apples are there?”, the attribute of a specific object “What is the color of plate?”, the relationship between two objects “What is the relationship between pan and bread?” and the objects included in certain objects “What things are in the refrigerator?”, and the owned object of the agent “What is the thing the agent has?”.

3.2. Scene Graph

A scene graph in the VEQA problem structurally represents an environmental state estimated based on the agent’s visual experience history . A 3D scene graph, , is composed of a set of nodes, , and a set of edges, . In a scene graph , each node represents an object or an attribute value that the object can have at time t. Consequently, the node set is expressed as , where denotes a set of objects in the environment and denotes a set of possible attribute values. For example, and . In this study, each object has two different attributes representing its color and openness. The color attribute of an object can have one of the predefined six color values, such as “Red” and “Gray,” and the openness attribute of an object, such as a refrigerator and a dresser, can have one of the three values, “Opened,” “Closed,” and “Unable.”

Furthermore, an edge in a scene graph represents a 3D spatial relationship between two objects at time t or a specific attribute of one object. Therefore, the edge set is expressed as , where denotes a set of spatial relationships between two objects and denotes a set of predefined attributes of each object class. For example, and . In this study, we define 9 different spatial relationship types between 25 different object classes to represent a 3D scene graph: “Left_Of,” “Right_Of,” “InFront_Of,” “Behind_Of,” “Over,” “Under,” “In,” “On,” and “Has.”

3.3. Data  Collection

The VEQA dataset was collected using the 3D indoor virtual environment, AI2-THOR [31]. To collect the VEQA dataset, we first defined 200 action scenarios that include a series of agent actions to be executed in different initial configurations of the environment. Each action scenario contains approximately 77 actions on average. Then, we collect input image and the corresponding scene graph data by executing each predefined action scenario using the simulated environment AI2-THOR. The question-answer data was generated semiautomatically based on scene graphs per every 10 agent actions in each action scenario. Consequently, a total 3,916 scene graphs including 13,109 objects, 26,218 attributes, and 25,583 relationships were built, and 5,397 question-answer pairs of six different types were generated. Table 1 lists the detailed specifications of this VEQA dataset. 80% of the VEQA dataset was used as the training set, 10% as the validation set, and 10% as the test set, respectively.

4. System Design

4.1. System Overview

To solve the abovementioned VEQA problem, we propose the VEQA system (VQAS), as shown in Figure 3. The proposed VEQA system first generates the scene graph representing the current environmental state based on observed images and agent actions and then makes a correct answer to the given question through knowledge reasoning on the scene graph. Accordingly, the VQAS consists of two subsystems: a scene graph generation system and a knowledge reasoning system.

The scene graph generation system uses a novel 3D scene generation model which can generate a series of 3D scene graphs from successive partial input images in a dynamic environment. The model consists of two different modules: a state recognition module and a state prediction module. The state recognition module expands the previous scene graph into the updated current one by considering the new partial image. On the other hand, the state prediction module predicts the current scene graph by applying effects of the executed action into the previous scene graph. The generated scene graphs are transformed into the formal state knowledge representation for use in knowledge reasoning.

In the knowledge reasoning system, a correct answer to the question is derived by applying predefined reasoning rules over a set of facts. These facts are from the static background knowledge on the environment as well as the dynamic state knowledge generated on the fly by the scene graph generation. The state knowledge corresponds to the environmental state representation, describing the current attributes and spatial relationships of objects. On the contrary, the background knowledge base corresponds to a set of facts that are predefined or assumed about the environment. A given natural language question is parsed to a formal query in triple form <subject, predicate, object> for knowledge reasoning using a deep neural network-based semantic parser. The knowledge reasoning system [32] is implemented on top of the SWI-Prolog RDF/OWL inference engine to derive the answer to the query over the abovementioned state and background knowledge facts.

The proposed VEQA system (VQAS) uses a predefined ontology for knowledge representation based on Description Logic (DL). The ontology is composed of various hierarchical classes and their properties, as shown in Figure 4. To express state knowledge and background knowledge base, all the objects types that can appear in the environment, object attributes, and relationships between objects must be predefined using the classes and properties in the ontology. For example, through the IS-A relationship between the “Apple” and “Fruit” classes defined in the ontology, as shown in Figure 4, the prior background knowledge that “Apple” is a type of “fruit” can be used in the question. Furthermore, the state knowledge is expressed as instances of ontology. Each instance of ontology is expressed in three parts: the subject, predicate, and object.

4.2. Scene Graph Generation

To generate a scene graph of the entire environment from a partially observed image, an overall visual understanding of multiple images taken from different viewpoints in the environment is required. Furthermore, the scene graph must be updated and expanded in real time whenever a new image is observed in the VEAQ environment. However, the simultaneously processing of the increasing number of images is expensive and complex.

To solve this problem, we design a scene graph generation model to expand the previous scene graph into the current scene graph by considering a new image , as described in the following equations:

Our model also considers dynamics of the environment to generate accurate 3D scene graphs. We can get more accurate scene graphs by considering environmental changes caused by the agent’s actions additionally. This can complement wrong recognition to construct a partial scene graph from the current image . As shown in the following equations, we get the predicted current scene graph representing the resulting state after executing the action by applying the corresponding action model into the previous scene graph . And then, based on the previous scene graph , the recognition-based scene graph and the prediction-based scene graph are combined into the current scene graph .

Figure 5 shows the process of generating a scene graph at time . While the state recognition module in this figure corresponds to the function used in equation (1), the state prediction module plays the same role of the function used in equation (3). Based on the trained deep neural network, the state recognition module builds a partial scene graph from a new input image without any consideration of the executed agent action. However, there can be some errors in the partial scene graph due to challenging visual recognition. On the other hand, the state prediction module constructs the estimated current scene graph by predicting only the environmental changes caused by the last executed action without any consideration of the newly observed image . Therefore, there can be some errors in the estimated scene graph due to wrong action models. To complement the weakness of two modules, the proposed scene graph generation model builds the current scene graph by combining the recognition-based partial scene graph with the prediction-based scene graph , as formulated in equation (4).

The state recognition module to extract a partial scene graph from a new image consists of four different neural networks: Object Detection Network (ODN), Three-dimensional Localization Network (TLN), Attribute Recognition Network (ARN), and Relationship Recognition Network (RRN), as shown in Figure 6.

First, to recognize object observed in image , 2D object detection and 3D localization were performed sequentially. In 2D object detection, the 2D bounding boxes of objects in the image were determined using Object Detection Network (ODN) based on YOLOv3 [33], which can detect even small objects. After 2D object detection, the 3D bounding boxes of detected objects are determined through 3D Localization Network (TLN). In many conventional 3D object detection models, the object positions are expressed in relative coordinates centered on the viewpoint of agent as an observer [34]. However, to construct an invariant 3D scene graph regardless of the agent’s position, the object positions should be expressed in absolute coordinates centered on a specific point in the environment.

We design a 3D Localization Network (TLN) as shown in Figure 7, considering this problem. The TLN first extracts the relative position feature from the 2D bounding box of an object and the depth image [34]. This feature involves relative position information of the object based on agent pose. And then, the TLN predicts the origin of absolute coordinates by using both relative positions of the object and the agent. To improve the prediction accuracy, the TLN network also makes use of the object class information extracted by the Object Detection Network (ODN).

After 3D localization of objects detected in the image, the attributes of individual objects, , are recognized through the Attribute Recognition Network (ARN), as shown in Figure 8. The ARN determines one of the predefined values for some object attributes such as color and openness. These object attributes can be recognized depending on visual features extracted from the image through a convolutional neural network (CNN) like ResNet or VGG. However, we design the ARN network to make use of the object class information additionally. The class information of an object often helps to estimate a certain attribute of the object. For example, we can predict the color of an apple that must be red or green without visual information.

Finally, to detect all possible relationships between objects, the state recognition module generates a set of object pairs between the detected objects. An object pair is composed of a subjective object and an objective object. As the VQAS system assumes a partially observable environment, the relationships between two objects detected independently in different images as well as the relationships between two objects detected in the same image should be recognized all together. Consequently, for relationship recognition at time , all possible object pairs are generated using the objects detected in the image at time and the objects already included in the previous scene graph .

After all possible object pairs are generated, the spatial relationship matching to each object pair is determined through the Relationship Recognition Network (RRN), as shown in Figure 9. The RRN receives the 3D bounding boxes of two objects as an input to recognize the spatial relationship between them. It also receives the class information of two objects as another input because the allowed relationships between two objects often depend on the classes of participating objects. For example, “Apple” and “Spoon” cannot have the “In” relationship, but “Apple” and “Fridge” can.

As mentioned above, the state recognition module can only recognize the environment state using a single image and cannot consider the environmental changes that do not appear in the image . For example, in Figure 1(b), when the position of “Bread” and its relationship with another object are changed through action  = “Pick up Bread,” it is difficult for the state recognition module to detect this environmental change. Besides, there can be some errors in the resulting partial scene graph due to challenging visual recognition. To overcome these problems, the proposed scene graph generation model also makes use of the action-based state prediction module.

For state prediction, the expected effect of environmental change is defined for each agent action as an action model. Figure 10 shows some examples of the action models represented in Planning Domain Description Language (PDDL) [35]. As the VQAS system assumes an environment with uncertainty in the actions, the potential effect of environmental change is represented with the actual occurrence probability, as shown in Figure 10. In the state prediction module, the effect of environmental change caused by the last action is obtained first from the corresponding action model. Then, the effect is applied to the previous scene graph to generate the scene graph .

For more accurate scene graph generation, the proposed system combines two scene graphs and generated through state recognition and state prediction in a complementary manner. Before the combination, the objects in the two scene graphs were compared to determine whether they are the same object. To that end, the 3D intersection over union (IoU) was determined, which indicates the degree of intersection of two objects in a 3D space. If the 3D IoU is greater than 0.3, the two objects are determined as identical. After all the objects were compared in this way, the objects that only existed in one of the two scene graphs were added to scene graph with attributes and relationships of the objects. In contrast, the objects that were included in both scene graphs were added to scene graph in their original forms only if the object attributes and relationships were the same in both scene graphs. If the object attributes or relationships differed, they were integrated as follows and the result was added to scene graph . Two different results are combined based on the probability distribution of each class. In the case of state recognition, the probability distribution of each class obtained from the recognition network is used. In the case of state prediction, the probabilities that appear in the environmental change effect were used in probability distribution. The weighted average of these probabilities was determined for each class, and the highest value was selected as the final result. The following equation expresses this integration method:where and are the probability distribution of classes obtained through recognition and prediction, respectively, and is the weighted ratio of the recognition result. The higher the recognition accuracy is, the more advantageous it is to have a higher value. In this study, a higher weight was given to the state recognition result. The weighted average method has the advantage of always providing a result that complements the recognition and prediction results. However, its disadvantage is that if the accuracy of the recognition or prediction result is low, the combined result also has a low accuracy.

The generated scene graphs are transformed into the formal state knowledge representation for use in knowledge reasoning. A knowledge fact representing the environmental state is expressed as a triple of <subject, predicate, object>, and a state knowledge is a set of these facts. On the other hand, a scene graph may include multiple relationship/attribute edges, each of which corresponds to a knowledge fact. Thus, the proposed system transforms each relationship/attribute edge with two participating objects into a knowledge fact expressed as a triple of <subject, predicate, object>. Figure 11 illustrates the transformation of a scene graph into a state knowledge based on the predefined ontology.

4.3. Knowledge Reasoning for QA

The knowledge reasoning is one of the techniques that have been mainly used in traditional artificial intelligence for a long time. This technique can be used to find new facts from known facts or to correctly deduce answers corresponding to a given question, based on predefined reasoning rules [32]. This technique has many advantages such as explainable inference process and simple incorporation of large prebuilt background knowledge. To take these advantages, we use a knowledge reasoning system for scene graph-based question answering in our VQAS system.

This section describes the knowledge reasoning process for visual experience-based question answering (VEQA) with the state knowledge and a background knowledge base, as shown in Figure 12. In the knowledge reasoning process, both the state and the background knowledge are assumed to be built based on the same context ontology. When a natural language question about an environmental state is given, it is first transformed to a formal query for knowledge reasoning. Then, an answer to the question is derived by performing knowledge reasoning over the knowledge source combining the corresponding state knowledge and the background knowledge, starting from the formal query.

Figure 12 shows the knowledge reasoning process for visual experience-based question answering (VEQA). The state knowledge is generated in real time as the agent performs an action and receives an input image . Each time the state knowledge is generated, the perception handler stores and updates it in the working memory. Furthermore, the background knowledge is loaded from a background-knowledge base when the system is started. However, unlike the state knowledge , the background knowledge remains constant in the working memory from the system initiation because it does not change in the given environment. Two different knowledge types stored in the working memory are used as the knowledge source for knowledge reasoning to answer a given question.

For knowledge reasoning, a natural language question is parsed into a formal query by a trained Question Parsing Network (QPN). A formal query is expressed as a triple of <subject, predicate, object> representing a natural language question. For example, the natural language question “How many fruits are in the room?” in Figure 12 can be changed to the triple query, <Fruit, NumberOfObject, ?>.

The proposed system transforms a natural language question into a triple query by using a Question Parsing Network (QPN), as shown in Figure 13. The expression of natural language questions is characterized by a sequence of words. Consequently, the features of natural language questions are extracted by the bidirectional long short-term memory of a recurrent neural network, which can process sequence data. Furthermore, the subject, predicate, and object comprising triple queries are predicted through the extracted features. This QPN is used as the question parser of the query processor in Figure 12.

In query processing, the natural language question is first transformed into a through-the-question parser. Then, the triple query is expressed as a SWI-Prolog [36] query according to the predefined reasoner by using the query translator. Next, the question executer infers the answer to the question by querying the working memory or by using the reasoner. The reasoner generates new knowledge based on the knowledge types built in the working memory by using predefined reasoning rules. Figure 14 illustrates the reasoning rules defined in SWI-Prolog; these comprise other predefined predicates, each of which obtains a fact satisfying the condition by querying the triple knowledge in the working memory and class layer of the ontology. These reasoning rules are defined one by one for each query type so that the correct answer can be generated for every given query. Lastly, the knowledge obtained through the reasoning rules is transformed from the SWI-Prolog format into natural language format that can be understood by humans through the answer generator.

5. Implementation and Evaluation

5.1. Model Training

To implement the proposed system, deep neural network modules comprising the system were trained independently. These modules to be trained included TLN for 3D scene graph generation, ARN, RRN, and QPN for question answering. First, as the loss function for training, the TLN used the mean abstract error, as follows:where and denote the predicted and accurate 3D bounding boxes, respectively. For the other deep neural network modules, the following cross entropy error was used:where denotes one-hot encoding value for the ground truth answer class and denotes the probability for each class predicted by the model.

Furthermore, each network was trained using Adam optimizer with a learning rate of 0.001. Also, we used the step decay that decreases the learning rate 10 times per each 10 epochs. The proposed system receives a 2D bounding box of an object that includes a small error because the TLN and ARN use the recognized results from the 2D object detection network (ODN). Therefore, in this study, random noise, considering the error range of the ground truth 2D bounding box, was added when training the TLN and ARN. In addition, each neural network was implemented with PyTorch, a Python deep learning library, and trained with GeForce GTX TITAN X GPU.

6. Experiments

To evaluate the performance of the proposed VQAS system, we conduct several experiments with the VEQA benchmark dataset. (1) The first experiment is performed for evaluating the question answering performance of the proposed system. In this experiment, different VQAS system configurations with ground truth data are compared with each other: VQAS (without any ground truth data), VQAS with ground truth 2D objects, VQAS with ground truth scene graphs, VQAS with ground truth queries. Table 2 shows the experimental results classified into six different types of VEQA questions. The experimental results showed that the proposed VQAS system without any ground truth data achieved a high performance of 72.37% in average for all types of questions. The system performance when the ground truth class and 2D bounding box of the objects are directly given (VQAS with GT 2D objects) shows a difference of 11% with VQAS. This shows that the 2D object detection performance significantly affects the visual experience-based question answering performance because the object detection result affects attribute recognition, 3D localization, and relationship recognition. Next, the system performance when the ground truth scene graph was used for question answering (VQSA with GT scene graphs) shows a result of close to 100%, implying that the VEQA problem can be solved sufficiently through correct scene graph generation.

Furthermore, the experimental result when the ground truth accurate query was used without using the QPN (VQAS with GT query) shows that the performance result has no significant difference from that of the original VQAS. This suggests that the QPN can predict the corresponding correct query from the natural language questions. Furthermore, the overall result shows that the performances of existence and counting type questions are the highest because the proposed system can explicitly represent objects in the environment and infer the accurate answer based on explicit knowledge. The question type that showed the lowest performance was Include. To answer an Include question in the proposed system, correct object detection and relationship recognition are required. Furthermore, all objects with an In relationship with a specific object must be correctly recognized. Consequently, the Include question type showed a lower performance than other question types.(2)The second experiment is performed for evaluating the scene graph generation performance of the proposed VQAS system with different state estimation methods. In this experiment, two different state estimation methods are compared: VQAS without SP (without state prediction) and VQAS (with action model-based state prediction). However, both two methods share the same state recognition module to estimate the current environmental state. Two different performance measures are used for this experiment: Object mAP for 3D object detection and SGGen for scene graph generation. Object mAP represents the mAP (mean Average Precision) of objects, the class of which is the same as that of the ground truth object, and for which the 3D IoU (Intersection over Union) is greater than 0.3. The SGGen represents the recall of the triples comprising the ground truth scene graph. Furthermore, the experiments were performed separately according to the attribute and object relationship depending on the predicate type of the triple. Each triple was determined as the ground truth answer when the object class of the triple was the same as that of the ground truth object and has a 3D IoU of more than 0.3, and for which the other relationship and attribute classes matched those of the ground truth.

The experimental result in Table 3 confirmed that the use of the action model-based state prediction helps improve the performance of 3D object detection and that of scene graph generation. Regarding object detection, the action model allows the accurate prediction of the object’s positional changes. For example, when the agent picks up an object or moves while holding an object, it is difficult to determine the positional changes of the object in the images. In contrast, the moved position of the object can be predicted through the action model. For scene graph generation, both the performances for attributes and relationships improved. This suggests that the state prediction result using action models and the state recognition result generated through visual recognition can be combined in a complementary manner.
(3)The third experiment is performed for evaluating the scene graph generation performance of the proposed VQAS system with different state recognition models. In this experiment, four different state recognition models are compared: , , , and . , , and represent the cases of using deep neural network modules for 2D object detection, attribute recognition, 3D localization, and relationship recognition. In contrast, , , , and represent the cases of using the ground truth instead of a deep neural network module. Table 4 shows the results of this experiment. In Table 4, the proposed state recognition model shows an SGGen performance of 53.73% and can thus generate scene graphs above a certain level of accuracy. However, the more the recognition networks are used, the lower the performance is. This seems to be because the recognition error in each recognition network has a negative effect on the recognition of the next recognition network. In particular, when the result of the system using the object detection network is compared with the result of the system using ground truth , both mAP and SGGen decrease when the objects are automatically detected in the image by using an object detection network. This is because the performance of the object detection network affects all the other recognition networks using the result. In other words, if the objects in the input image cannot be found accurately, the accuracy of the scene graph representing the object attributes and their spatial relationships will inevitably decrease.
(4)The fourth experiment is performed for evaluating 3D localization performance of the TLN neural network with different input information. In this experiment, three different input pieces of information for the TLN neural network are compared: (Depth Image + 2D Bbox), (Depth Image + 2D Bbox + Agent Pose), and (Depth Image + 2D Bbox + Agent Pose + Object Class). Table 5 shows the results for this experiment. The results in the positions column indicate <x, y, z> corresponding to a position in the predicted 3D bounding box, and the size indicates <, h, d> corresponding to size. Furthermore, the mean abstract error, which is the difference from the ground truth, was used as a measure. The accuracy indicates the ratio of 3D IoU values greater than 0.3 from the ground truth. The result in Table 5 suggests that the performance is very low when only the depth image and 2D bounding box were used, because both input pieces of information represent only the relative position of the object from the agent’s perspective. In contrast, when the agent pose information was used together with Depth Image + 2D Bbox, the performance is seen to greatly improve compared to the result of the system that did not use the agent pose. This is because the agent pose provides information that can reveal the absolution position from the relative position of the object. Furthermore, the performance improved further when the object class was used (Depth Image + 2D Bbox + Agent Pose + Object Class). This is because the object class shows the general size of the object. For example, the size of “Apple” is small, and the size of “Fridge” is large.
(5)The fifth experiment is performed for evaluating attribute recognition performance of the ARN neural network with different input information. In this experiment, three different input pieces of information for the ARN neural network are compared: (Image + 2D Bbox), (Object Class), and (Image + 2D Bbox + Object Class). Table 6 shows the results for this experiment. First, when only the image and the 2D bounding boxes of objects were used for attribute recognition (Image + 2D Bbox), both color and openness showed performances higher than 80%. In contrast, when only the object class was used for attribute recognition (Object Class), only the performance of the openness was high. Similar to the experimental results in Table 5, the object class can be used for predicting the general value of each object class. For example, we can see that “Apple” has an attribute showing that it cannot be opened or closed, whereas “Fridge” can have the “Opened” or “Closed” value. Furthermore, the highest performance is obtained when the image, 2D bounding box, and object class are used together for attribute recognition (Image + 2D Bbox + Object Class). This is because the above-explained advantages for each input information are helpful for improving the independent performance.
(6)The sixth experiment is performed for evaluating relationship recognition performance of the RRN neural network with different input information. In this experiment, three different input pieces of information for the RRN neural network are compared: (3D Bbox), (Object Class), and (3D Bbox + Object Class). Table 7 shows the results for this experiment. These results confirm that spatial relationships can be recognized even when only the 3D bounding boxes of two objects are used for relationship recognition (3D Bbox). This is because the spatial relationships between two objects can be estimated to some degree through comparison of the positions of the two objects in the space. Furthermore, when only the object class is used for relationship recognition (Object Class), the resulting performance is lower than that when only using the 3D bounding box (3D Bbox). However, the performance result shows that spatial relationships can be recognized to some degree by only using the object class. This is because the possible spatial relationships can be limited according to the object class. For example, as shown, “Bread” and “Apple” have spatial relationships excluding “In” and “On”; the “In” relationship can be expected for “Egg” and “Fridge.” When two different pieces of information (3D Bbox + Object Class) were used together, the highest performance was obtained, implying that two different pieces of information can help improve performance independently.

Finally, Figure 15 shows a sequence of examples for qualitative performance analysis of the proposed VQAS system. First, the figures on the left side show image observed at time t before the agent performs an action, detected object , and generated scene graph . The figures on the right side show image observed at time after the action performance, detected object , and generated scene graph .

Figure 15(a) illustrates the scene graphs and generated before and after executing “Rotate Right,” an action in which the agent changes its pose. An observation of the scene graph reveals that the relationship between “Microwave” and “Spoon” is recognized as “Right_Of.” However, while “Microwave” is not observed in image , “Spoon” is newly observed in image . This confirms that the proposed VQAS system can generate scene graphs properly even in a partially observable environment.

Figures 15(b) and 15(c) illustrate the scene graphs generated before and after executing actions of “Pick Up” and “Open,” in which the agent directly changes the environmental state. The scene graph in Figure 15(b) shows the proper prediction of the “Has” relationship of “Agent” and “Bread.” The scene graph in Figure 15(c) confirms that the openness of “Fridge” was modified and the newly observed objects are correctly recognized after performing “Open” action.

However, in the case of Figure 15(d), a wrong scene graph is generated owing to the recognition error. In this example, the 3D positions of “Fridge” appearing in the two images were recognized differently between images and . The same object was misclassified as “Bowl” in image and “Pan” in image ; this error caused the generation of a wrong scene graph. In this example, the two “Fridges” and both “Bowl” and “Pan” appear in scene graph . Thus, the recognition error in each recognition network has a negative effect on the scene graph. To improve this problem, future studies should devise a method to improve the performance of each recognition network and correct the recognition error and thus the derived one.

The table in the right side in Figure 15 shows the answers generated through triple query and knowledge reasoning predicted from the natural language question at time . Figures 15(a) and 15(b) show the correct prediction of the triple query and generation of a correct answer. However, Figure 15(c) shows the correct prediction of the triple query but the generation of a wrong answer due to erroneous scene graph generation. This shows that the proposed system can generate a wrong answer that appears in the scene graph, as in the case of Figure 15(c) because it performs knowledge reasoning based on state. Therefore, the proposed system requires the generation of a correct scene graph for answer generation. In the case of Figure 15(d), even though the correct scene graph was generated, the triple query was predicted incorrectly. This could be because reasoning was performed using a query with a different meaning from that of the given question. These examples suggest that the accuracy of the scene graph generation and the accuracy of question parsing both affect the answer generation.

7. Conclusion

This paper proposed a novel VEQA problem and the corresponding dataset for embodied intelligence research that requires an agent to do actions, understand entire 3D scenes from successive partial input images, and answer natural language questions about the dynamic scenes in a complex multimodal environment. To address the VEQA problem, we propose a hybrid visual question answering system, VQAS, integrating a deep neural network-based scene graph generation model and a rule-based knowledge reasoning system. Furthermore, we propose a novel 3D scene graph generation model which can generate a series of 3D scene graphs from successive partial input images in a dynamic environment. The proposed model can overcome the limitation of the conventional scene graph generation models building just a 2D scene graph from a single still image. The model also meets well the partial observability and dynamics of the VEQA environment. We also propose a knowledge reasoning system to answer natural language questions based on 3D scene graphs. Different from the pure deep neural network-based models, the proposed knowledge reasoning system can use a rich knowledge source to answer questions by combining the shallow knowledge in 3D scene graphs with a large amount of prebuilt deep background knowledge.

In this study, a series of experiments using AI2-THOR and the VEQA benchmark dataset were performed to analyze the performance and limitation of the proposed VQAS system. The results of these experiments verified the usefulness of the VEQA problem and the high performance of the proposed VQAS system. However, a few limitations of the proposed system were also found, such as the possibility that the state knowledge may not express the observed environmental state perfectly or may generate a wrong answer when the estimated state is different from that of the actual environment. To improve these limitations, the generation of more accurate state knowledge will be researched in the future.

Data Availability

The VEQA data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by Kyonggi University Research Grant 2019.