当前位置: X-MOL 学术Electron. Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Transformer with Sparse Self-Attention Mechanism for Image Captioning
Electronics Letters ( IF 1.1 ) Pub Date : 2020-07-01 , DOI: 10.1049/el.2020.0635
Duofeng Wang 1 , Haifeng Hu 1 , Dihu Chen 1
Affiliation  

Recently, transformer has been applied to the image caption model, in which the convolutional neural network and the transformer encoder act as the image encoder of the model, and the transformer decoder acts as the decoder of the model. However, transformer may suffer from the interference of non-critical objects of a scene and meet with difficulty to fully capture image information due to its self-attention mechanism's dense characteristics. In this Letter, in order to address this issue, the authors propose a novel transformer model with decreasing attention gates and attention fusion module. Specifically, they firstly use attention gate to force transformer to overcome the interference of non-critical objects and capture objects information more efficiently via truncating all the attention weights that smaller than gate threshold. Secondly, through inheriting attentional matrix from the previous layer of each network layer, the attention fusion module enables each network layer to consider other objects without losing the most critical ones. Their method is evaluated using the benchmark Microsoft COCO dataset and achieves better performance compared to the state-of-the-art methods.

中文翻译:

用于图像字幕的具有稀疏自注意力机制的 Transformer

最近,transformer 被应用于图像字幕模型,其中卷积神经网络和transformer 编码器作为模型的图像编码器,transformer 解码器作为模型的解码器。然而,Transformer 可能会受到场景中非关键对象的干扰,并且由于其自注意力机制的密集特性而难以完全捕获图像信息。在这封信中,为了解决这个问题,作者提出了一种具有减少注意力门和注意力融合模块的新型变压器模型。具体来说,他们首先使用注意力门来强制变换器克服非关键对象的干扰,并通过截断所有小于门阈值的注意力权重来更有效地捕获对象信息。第二,通过从每个网络层的前一层继承注意力矩阵,注意力融合模块使每个网络层能够在不丢失最关键的对象的情况下考虑其他对象。他们的方法使用基准 Microsoft COCO 数据集进行评估,与最先进的方法相比,性能更好。
更新日期:2020-07-01
down
wechat
bug