当前位置: X-MOL 学术arXiv.cs.MM › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability
arXiv - CS - Multimedia Pub Date : 2021-06-25 , DOI: arxiv-2107.10300
Aman Chadha, Vinija Jain

Causality knowledge is vital to building robust AI systems. Deep learning models often perform poorly on tasks that require causal reasoning, which is often derived using some form of commonsense knowledge not immediately available in the input but implicitly inferred by humans. Prior work has unraveled spurious observational biases that models fall prey to in the absence of causality. While language representation models preserve contextual knowledge within learned embeddings, they do not factor in causal relationships during training. By blending causal relationships with the input features to an existing model that performs visual cognition tasks (such as scene understanding, video captioning, video question-answering, etc.), better performance can be achieved owing to the insight causal relationships bring about. Recently, several models have been proposed that have tackled the task of mining causal data from either the visual or textual modality. However, there does not exist widespread research that mines causal relationships by juxtaposing the visual and language modalities. While images offer a rich and easy-to-process resource for us to mine causality knowledge from, videos are denser and consist of naturally time-ordered events. Also, textual information offers details that could be implicit in videos. We propose iReason, a framework that infers visual-semantic commonsense knowledge using both videos and natural language captions. Furthermore, iReason's architecture integrates a causal rationalization module to aid the process of interpretability, error analysis and bias detection. We demonstrate the effectiveness of iReason using a two-pronged comparative analysis with language representation learning models (BERT, GPT-2) as well as current state-of-the-art multimodal causality models.

中文翻译:

iReason:使用具有可解释性的视频和自然语言进行多模态常识推理

因果关系知识对于构建强大的人工智能系统至关重要。深度学习模型通常在需要因果推理的任务上表现不佳,因果推理通常是使用某种形式的常识知识推导出来的,这些常识知识不是在输入中立即可用,而是由人类隐式推断出来的。先前的工作已经解决了模型在没有因果关系的情况下成为牺牲品的虚假观察偏差。虽然语言表示模型在学习的嵌入中保留了上下文知识,但它们不会在训练期间考虑因果关系。通过将因果关系与输入特征混合到执行视觉认知任务(例如场景理解、视频字幕、视频问答等)的现有模型中,由于因果关系带来的洞察力,可以获得更好的性能。最近,已经提出了几种模型来解决从视觉或文本模态中挖掘因果数据的任务。然而,并没有通过将视觉和语言模式并置来挖掘因果关系的广泛研究。虽然图像为我们提供了丰富且易于处理的资源来挖掘因果关系知识,但视频更密集并且由自然时间顺序的事件组成。此外,文本信息提供了可能隐含在视频中的细节。我们提出了 iReason,这是一个使用视频和自然语言字幕来推断视觉语义常识知识的框架。此外,iReason 的架构集成了一个因果合理化模块,以辅助可解释性、错误分析和偏差检测的过程。
更新日期:2021-07-23
down
wechat
bug