当前位置: X-MOL 学术Inform. Fusion › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Explain and improve: LRP-inference fine-tuning for image captioning models
Information Fusion ( IF 18.6 ) Pub Date : 2021-07-31 , DOI: 10.1016/j.inffus.2021.07.008
Jiamei Sun 1 , Sebastian Lapuschkin 2 , Wojciech Samek 2, 3 , Alexander Binder 1, 4
Affiliation  

This paper analyzes the predictions of image captioning models with attention mechanisms beyond visualizing the attention itself. We develop variants of layer-wise relevance propagation (LRP) and gradient-based explanation methods, tailored to image captioning models with attention mechanisms. We compare the interpretability of attention heatmaps systematically against the explanations provided by explanation methods such as LRP, Grad-CAM, and Guided Grad-CAM. We show that explanation methods provide simultaneously pixel-wise image explanations (supporting and opposing pixels of the input image) and linguistic explanations (supporting and opposing words of the preceding sequence) for each word in the predicted captions. We demonstrate with extensive experiments that explanation methods (1) can reveal additional evidence used by the model to make decisions compared to attention; (2) correlate to object locations with high precision; (3) are helpful to “debug” the model, e.g. by analyzing the reasons for hallucinated object words. With the observed properties of explanations, we further design an LRP-inference fine-tuning strategy that reduces the issue of object hallucination in image captioning models, and meanwhile, maintains the sentence fluency. We conduct experiments with two widely used attention mechanisms: the adaptive attention mechanism calculated with the additive attention and the multi-head attention mechanism calculated with the scaled dot product.



中文翻译:

解释和改进:图像字幕模型的 LRP 推理微调

本文分析了具有注意力机制的图像字幕模型的预测,而不是将注意力本身可视化。我们开发了分层相关传播 (LRP) 和基于梯度的解释方法的变体,适用于具有注意机制的图像字幕模型。我们系统地将注意力热图的可解释性与 LRP、Grad-CAM 和 Guided Grad-CAM 等解释方法提供的解释进行了比较。我们表明,解释方法同时为预测字幕中的每个单词提供逐像素图像解释(输入图像的支持和相反像素)和语言解释(前面序列的支持和反对单词)。我们通过大量实验证明,与注意力相比,解释方法(1)可以揭示模型用于做出决策的额外证据;(2) 高精度关联物体位置;(3) 有助于“调试”模型,例如通过分析产生幻觉对象词的原因。利用观察到的解释特性,我们进一步设计了一种 LRP 推理微调策略,以减少图像字幕模型中的对象幻觉问题,同时保持句子的流畅性。我们使用两种广泛使用的注意机制进行实验:使用加法注意计算的自适应注意机制和使用缩放点积计算的多头注意机制。(3) 有助于“调试”模型,例如通过分析产生幻觉对象词的原因。利用观察到的解释特性,我们进一步设计了一种 LRP 推理微调策略,以减少图像字幕模型中的对象幻觉问题,同时保持句子的流畅性。我们使用两种广泛使用的注意机制进行实验:使用加法注意计算的自适应注意机制和使用缩放点积计算的多头注意机制。(3) 有助于“调试”模型,例如通过分析产生幻觉对象词的原因。利用观察到的解释特性,我们进一步设计了一种 LRP 推理微调策略,以减少图像字幕模型中的对象幻觉问题,同时保持句子的流畅性。我们使用两种广泛使用的注意机制进行实验:使用加法注意计算的自适应注意机制和使用缩放点积计算的多头注意机制。

更新日期:2021-08-01
down
wechat
bug