当前位置:
X-MOL 学术
›
arXiv.cs.CV
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2020-01-20 , DOI: arxiv-2001.07059 Moshiur R. Farazi, Salman H. Khan, Nick Barnes
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2020-01-20 , DOI: arxiv-2001.07059 Moshiur R. Farazi, Salman H. Khan, Nick Barnes
Visual Question Answering (VQA) has emerged as a Visual Turing Test to
validate the reasoning ability of AI agents. The pivot to existing VQA models
is the joint embedding that is learned by combining the visual features from an
image and the semantic features from a given question. Consequently, a large
body of literature has focused on developing complex joint embedding strategies
coupled with visual attention mechanisms to effectively capture the interplay
between these two modalities. However, modelling the visual and semantic
features in a high dimensional (joint embedding) space is computationally
expensive, and more complex models often result in trivial improvements in the
VQA accuracy. In this work, we systematically study the trade-off between the
model complexity and the performance on the VQA task. VQA models have a diverse
architecture comprising of pre-processing, feature extraction, multimodal
fusion, attention and final classification stages. We specifically focus on the
effect of "multi-modal fusion" in VQA models that is typically the most
expensive step in a VQA pipeline. Our thorough experimental evaluation leads us
to two proposals, one optimized for minimal complexity and the other one
optimized for state-of-the-art VQA performance.
中文翻译:
准确性与复杂性:视觉问答模型的权衡
视觉问答 (VQA) 已成为一种视觉图灵测试,用于验证 AI 代理的推理能力。现有 VQA 模型的关键是联合嵌入,通过将图像的视觉特征和给定问题的语义特征相结合来学习。因此,大量文献专注于开发复杂的联合嵌入策略以及视觉注意机制,以有效捕捉这两种模式之间的相互作用。然而,在高维(联合嵌入)空间中对视觉和语义特征进行建模在计算上是昂贵的,并且更复杂的模型通常会导致 VQA 准确性的微不足道的提高。在这项工作中,我们系统地研究了模型复杂性和 VQA 任务性能之间的权衡。VQA 模型具有多种架构,包括预处理、特征提取、多模态融合、注意力和最终分类阶段。我们特别关注 VQA 模型中“多模态融合”的影响,这通常是 VQA 管道中最昂贵的步骤。我们彻底的实验评估使我们提出了两个建议,一个针对最小复杂性进行了优化,另一个针对最先进的 VQA 性能进行了优化。
更新日期:2020-01-22
中文翻译:
准确性与复杂性:视觉问答模型的权衡
视觉问答 (VQA) 已成为一种视觉图灵测试,用于验证 AI 代理的推理能力。现有 VQA 模型的关键是联合嵌入,通过将图像的视觉特征和给定问题的语义特征相结合来学习。因此,大量文献专注于开发复杂的联合嵌入策略以及视觉注意机制,以有效捕捉这两种模式之间的相互作用。然而,在高维(联合嵌入)空间中对视觉和语义特征进行建模在计算上是昂贵的,并且更复杂的模型通常会导致 VQA 准确性的微不足道的提高。在这项工作中,我们系统地研究了模型复杂性和 VQA 任务性能之间的权衡。VQA 模型具有多种架构,包括预处理、特征提取、多模态融合、注意力和最终分类阶段。我们特别关注 VQA 模型中“多模态融合”的影响,这通常是 VQA 管道中最昂贵的步骤。我们彻底的实验评估使我们提出了两个建议,一个针对最小复杂性进行了优化,另一个针对最先进的 VQA 性能进行了优化。