Where and When: Space-Time Attention for Audio-Visual Explanations,arXiv - CS - Artificial Intelligence

当前位置： X-MOL 学术 › arXiv.cs.AI › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Where and When: Space-Time Attention for Audio-Visual Explanations
arXiv - CS - Artificial Intelligence Pub Date : 2021-05-04 , DOI: arxiv-2105.01517
Yanbei Chen, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

Explaining the decision of a multi-modal decision-maker requires to determine the evidence from both modalities. Recent advances in XAI provide explanations for models trained on still images. However, when it comes to modeling multiple sensory modalities in a dynamic world, it remains underexplored how to demystify the mysterious dynamics of a complex multi-modal model. In this work, we take a crucial step forward and explore learnable explanations for audio-visual recognition. Specifically, we propose a novel space-time attention network that uncovers the synergistic dynamics of audio and visual data over both space and time. Our model is capable of predicting the audio-visual video events, while justifying its decision by localizing where the relevant visual cues appear, and when the predicted sounds occur in videos. We benchmark our model on three audio-visual video event datasets, comparing extensively to multiple recent multi-modal representation learners and intrinsic explanation models. Experimental results demonstrate the clear superior performance of our model over the existing methods on audio-visual video event recognition. Moreover, we conduct an in-depth study to analyze the explainability of our model based on robustness analysis via perturbation tests and pointing games using human annotations.

中文翻译：

何时何地：视听说明的时空注意

解释多模式决策者的决策需要从两种模式中确定证据。XAI的最新进展为在静止图像上训练的模型提供了解释。但是，当涉及到在动态世界中对多个感官模态进行建模时，如何使复杂的多模态模型的神秘动力学难以神秘化仍未得到充分研究。在这项工作中，我们向前迈出了关键的一步，并探索了用于视听识别的可学的解释。具体来说，我们提出了一种新颖的时空注意力网络，该网络揭示了时空上音频和视频数据的协同动态。我们的模型能够预测视听视频事件，同时通过定位相关视觉线索的出现位置以及预测的声音何时出现在视频中来证明其决策的合理性。我们在三个视听视频事件数据集上对我们的模型进行了基准测试，并与多个最新的多模式表示学习者和内在解释模型进行了广泛比较。实验结果证明了我们的模型在视听视频事件识别方面明显优于现有方法。此外，我们进行了深入的研究，以基于通过扰动测试的鲁棒性分析和使用人类注释的指向游戏来分析模型的可解释性。

更新日期：2021-05-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>