当前位置: X-MOL 学术arXiv.cs.CV › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A First Look: Towards Explainable TextVQA Models via Visual and Textual Explanations
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2021-04-29 , DOI: arxiv-2105.02626
Varun Nagaraj Rao, Xingjian Zhen, Karen Hovsepian, Mingwei Shen

Explainable deep learning models are advantageous in many situations. Prior work mostly provide unimodal explanations through post-hoc approaches not part of the original system design. Explanation mechanisms also ignore useful textual information present in images. In this paper, we propose MTXNet, an end-to-end trainable multimodal architecture to generate multimodal explanations, which focuses on the text in the image. We curate a novel dataset TextVQA-X, containing ground truth visual and multi-reference textual explanations that can be leveraged during both training and evaluation. We then quantitatively show that training with multimodal explanations complements model performance and surpasses unimodal baselines by up to 7% in CIDEr scores and 2% in IoU. More importantly, we demonstrate that the multimodal explanations are consistent with human interpretations, help justify the models' decision, and provide useful insights to help diagnose an incorrect prediction. Finally, we describe a real-world e-commerce application for using the generated multimodal explanations.

中文翻译:

初步了解:通过视觉和文字说明建立可解释的TextVQA模型

可解释的深度学习模型在许多情况下都是有利的。先前的工作主要通过事后方法提供单峰解释,而不是原始系统设计的一部分。解释机制也忽略了图像中存在的有用的文本信息。在本文中,我们提出了MTXNet,这是一种端到端的可训练多模式体系结构,用于生成多模式解释,该解释着重于图像中的文本。我们策划了一个新颖的数据集TextVQA-X,其中包含地面实况可视化和多参考文本说明,可在培训和评估中加以利用。然后,我们定量地表明,使用多模式解释进行培训可以补充模型性能,并在CIDEr得分和IoU得分上超过单峰基线,最高可达7%。更重要的是,我们证明多模式解释与人类解释一致,有助于证明模型的决策合理,并提供有用的见解以帮助诊断错误的预测。最后,我们描述了一个实际的电子商务应用程序,用于使用生成的多模式解释。
更新日期:2021-05-07
down
wechat
bug