当前位置: X-MOL 学术ACM Trans. Multimed. Comput. Commun. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Adaptive Attention-based High-level Semantic Introduction for Image Caption
ACM Transactions on Multimedia Computing, Communications, and Applications ( IF 5.1 ) Pub Date : 2020-12-17 , DOI: 10.1145/3409388
Xiaoxiao Liu 1 , Qingyang Xu 1
Affiliation  

There have been several attempts to integrate a spatial visual attention mechanism into an image caption model and introduce semantic concepts as the guidance of image caption generation. High-level semantic information consists of the abstractedness and generality indication of an image, which is beneficial to improve the model performance. However, the high-level information is always static representation without considering the salient elements. Therefore, a semantic attention mechanism is used for the high-level information instead of conventional of static representation in this article. The salient high-level semantic information can be considered as redundant semantic information for image caption generation. Additionally, the generation of visual words and non-visual words can be separated, and an adaptive attention mechanism is employed to realize the guidance information of image caption generation switching between new fusion information (fusion of image feature and high-level semantics) and a language model. Therefore, visual words can be generated according to the image features and high-level semantic information, and non-visual words can be predicted by the language model. The semantics attention, adaptive attention, and previous generated words are fused to construct a special attention module for the input and output of long short-term memory. An image caption can be generated as a concise sentence on the basis of accurately grasping the rich content of the image. The experimental results show that the performance of the proposed model is promising for the evaluation metrics, and the captions can achieve logical and rich descriptions.

中文翻译:

图像描述的基于自适应注意的高级语义介绍

已经有一些尝试将空间视觉注意机制集成到图像说明模型中,并引入语义概念作为图像说明生成的指导。高级语义信息由图像的抽象性和通用性指示组成,有利于提高模型性能。但是,高级信息始终是静态表示,而不考虑显着元素。因此,本文对高级信息使用语义注意机制,而不是传统的静态表示。显着的高级语义信息可以被认为是图像标题生成的冗余语义信息。此外,视觉词和非视觉词的生成可以分开,并采用自适应注意力机制,实现新融合信息(图像特征与高级语义融合)与语言模型之间的图像字幕生成引导信息切换。因此,可以根据图像特征和高级语义信息生成视觉词,并且可以通过语言模型预测非视觉词。将语义注意、自适应注意和先前生成的单词融合在一起,构建了一个特殊的注意模块,用于长短期记忆的输入和输出。在准确把握图像丰富内容的基础上,可以将图像标题生成为简洁的句子。实验结果表明,该模型的性能对于评价指标是有希望的,
更新日期:2020-12-17
down
wechat
bug