当前位置: X-MOL 学术IEEE Trans. Multimedia › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Self-Adaptive Neural Module Transformer for Visual Question Answering
IEEE Transactions on Multimedia ( IF 7.3 ) Pub Date : 2020-05-18 , DOI: 10.1109/tmm.2020.2995278
Zhong Huasong , Jingyuan Chen , Chen Shen , Hanwang Zhang , Jianqiang Huang , Xian-Sheng Hua

Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and its variants rely on parsing the natural language question into a module layout (i.e., a problem-solving program). In particular, this process follows a feedforward encoder-decoder pipeline: the encoder embeds the question into a static vector and the decoder generates the layout. However, we argue that such conventional encoder-decoder neglects the dynamic nature of question comprehension (i.e., we should attend to different words from step to step) and per-module intermediate results (i.e., we should discard module performing badly) in the reasoning steps. In this paper, we present a novel NMN, called Self-Adaptive Neural Module Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout decoding by considering intermediate Q&A results. Specifically, we encode the intermediate results with the given question features by a novel transformer module to generate dynamic question feature embedding which evolves over reasoning steps. Besides, the transformer utilizes the intermediate results from each reasoning step to guide subsequent layout arrangement. Extensive experimental evaluations demonstrate the superiority of the proposed SANMT over NMN and its variants on four challenging benchmarks, including CLEVR, CLEVR-CoGenT, VQAv1.0, and VQAv2.0 (on average the relative improvement over NMN are 1.5, 2.3, 0.7 and 0.5 points with respect to accuracy).

中文翻译:

用于视觉问题回答的自适应神经模块变压器

视觉和语言理解是多媒体智能中最基本,最困难的任务之一。同时,视觉问题解答(VQA)更具挑战性,因为它要求复杂推理正确答案的步骤。为此,神经模块网络(NMN)及其变体依赖于将自然语言问题解析为模块布局(即问题解决程序)。特别地,此过程遵循前馈编码器-解码器管线:编码器将问题嵌入到静态向量中,然后解码器生成布局。但是,我们认为这种传统的编码器/解码器在推理中忽略了问题理解的动态性质(即,我们应该逐步注意不同的单词)和每个模块的中间结果(即,我们应该丢弃表现不佳的模块)脚步。在本文中,我们提出了一种新颖的NMN,称为自适应神经模块变压器(SANMT),它通过考虑中间Q&结果。具体而言,我们通过新颖的转换器模块对具有给定问题特征的中间结果进行编码,以生成动态推理特征嵌入,该嵌入随推理步骤而发展。此外,转换器利用每个推理步骤的中间结果来指导后续的布局布置。广泛的实验评估表明,在四个具有挑战性的基准(包括CLEVR,CLEVR-CoGenT,VQAv1.0和VQAv2.0)上,建议的SANMT优于NMN及其变体(平均,相对NMN的相对改进为1.5、2.3、0.7和0.7)。精度为0.5分)。转换器利用每个推理步骤的中间结果来指导后续的布局安排。广泛的实验评估表明,在四个具有挑战性的基准(包括CLEVR,CLEVR-CoGenT,VQAv1.0和VQAv2.0)上,建议的SANMT优于NMN及其变体(平均,相对NMN的相对改进为1.5、2.3、0.7和0.7)。精度为0.5分)。转换器利用每个推理步骤的中间结果来指导后续的布局安排。广泛的实验评估表明,在四个具有挑战性的基准(包括CLEVR,CLEVR-CoGenT,VQAv1.0和VQAv2.0)上,建议的SANMT优于NMN及其变体(平均,相对NMN的相对改进为1.5、2.3、0.7和0.7)。精度为0.5分)。
更新日期:2020-05-18
down
wechat
bug