Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2020-01-20 , DOI: arxiv-2001.07059
Moshiur R. Farazi, Salman H. Khan, Nick Barnes

Visual Question Answering (VQA) has emerged as a Visual Turing Test to validate the reasoning ability of AI agents. The pivot to existing VQA models is the joint embedding that is learned by combining the visual features from an image and the semantic features from a given question. Consequently, a large body of literature has focused on developing complex joint embedding strategies coupled with visual attention mechanisms to effectively capture the interplay between these two modalities. However, modelling the visual and semantic features in a high dimensional (joint embedding) space is computationally expensive, and more complex models often result in trivial improvements in the VQA accuracy. In this work, we systematically study the trade-off between the model complexity and the performance on the VQA task. VQA models have a diverse architecture comprising of pre-processing, feature extraction, multimodal fusion, attention and final classification stages. We specifically focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline. Our thorough experimental evaluation leads us to two proposals, one optimized for minimal complexity and the other one optimized for state-of-the-art VQA performance.

中文翻译：

准确性与复杂性：视觉问答模型的权衡

视觉问答 (VQA) 已成为一种视觉图灵测试，用于验证 AI 代理的推理能力。现有 VQA 模型的关键是联合嵌入，通过将图像的视觉特征和给定问题的语义特征相结合来学习。因此，大量文献专注于开发复杂的联合嵌入策略以及视觉注意机制，以有效捕捉这两种模式之间的相互作用。然而，在高维（联合嵌入）空间中对视觉和语义特征进行建模在计算上是昂贵的，并且更复杂的模型通常会导致 VQA 准确性的微不足道的提高。在这项工作中，我们系统地研究了模型复杂性和 VQA 任务性能之间的权衡。VQA 模型具有多种架构，包括预处理、特征提取、多模态融合、注意力和最终分类阶段。我们特别关注 VQA 模型中“多模态融合”的影响，这通常是 VQA 管道中最昂贵的步骤。我们彻底的实验评估使我们提出了两个建议，一个针对最小复杂性进行了优化，另一个针对最先进的 VQA 性能进行了优化。

更新日期：2020-01-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>