当前位置: X-MOL 学术arXiv.cs.CV › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Divide and Conquer: Answering Questions with Object Factorization and Compositional Reasoning
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2023-03-18 , DOI: arxiv-2303.10482
Shi Chen, Qi Zhao

Humans have the innate capability to answer diverse questions, which is rooted in the natural ability to correlate different concepts based on their semantic relationships and decompose difficult problems into sub-tasks. On the contrary, existing visual reasoning methods assume training samples that capture every possible object and reasoning problem, and rely on black-boxed models that commonly exploit statistical priors. They have yet to develop the capability to address novel objects or spurious biases in real-world scenarios, and also fall short of interpreting the rationales behind their decisions. Inspired by humans' reasoning of the visual world, we tackle the aforementioned challenges from a compositional perspective, and propose an integral framework consisting of a principled object factorization method and a novel neural module network. Our factorization method decomposes objects based on their key characteristics, and automatically derives prototypes that represent a wide range of objects. With these prototypes encoding important semantics, the proposed network then correlates objects by measuring their similarity on a common semantic space and makes decisions with a compositional reasoning process. It is capable of answering questions with diverse objects regardless of their availability during training, and overcoming the issues of biased question-answer distributions. In addition to the enhanced generalizability, our framework also provides an interpretable interface for understanding the decision-making process of models. Our code is available at https://github.com/szzexpoi/POEM.

中文翻译:

分而治之:用对象分解和组合推理回答问题

人类具有回答各种问题的天生能力,这植根于根据语义关系将不同概念关联起来并将难题分解为子任务的自然能力。相反,现有的视觉推理方法假设训练样本捕获每个可能的对象和推理问题,并依赖于通常利用统计先验的黑盒模型。他们还没有发展出解决现实世界场景中新事物或虚假偏见的能力,也没有解释他们决定背后的理由。受人类对视觉世界推理的启发,我们从构图的角度应对上述挑战,并提出了一个由原则性对象分解方法和新型神经模块网络组成的整体框架。我们的分解方法根据对象的关键特征分解对象,并自动派生出代表各种对象的原型。通过这些编码重要语义的原型,所提出的网络然后通过测量它们在公共语义空间上的相似性来关联对象,并通过组合推理过程做出决策。它能够回答具有不同对象的问题,而不管它们在训练期间是否可用,并克服了有偏见的问答分布问题。除了增强的通用性之外,我们的框架还提供了一个可解释的界面,用于理解模型的决策过程。我们的代码可在 https:
更新日期:2023-03-22
down
wechat
bug