当前位置: X-MOL 学术J. Visual Commun. Image Represent. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Parallel-fusion LSTM with synchronous semantic and visual information for image captioning
Journal of Visual Communication and Image Representation ( IF 2.6 ) Pub Date : 2021-02-09 , DOI: 10.1016/j.jvcir.2021.103044
Jing Zhang , Kangkang Li , Zhe Wang

For synchronously combining the dynamic semantic and visual information in the decoder part of image captioning, we propose a novel parallel-fusion LSTM (pLSTM) structure in this paper. Two parallel LSTMs with attributes and visual information of image are fused by the hidden states at every time step, which makes the attributes and visual information complementary or enhanced for generating more accurate captions. According to the different ways of integrating semantic information from attribute LSTM to visual LSTM, we propose two models pLSTM with attention (pLSTM-A) and pLSTM with guiding (pLSTM-G). pLSTM-A can automatically capture the crucial semantic and visual information to generate captions, and pLSTM-G directly adjusts the hidden state of visual LSTM by synchronous semantic information to the critical region. For verifying the effectiveness of our proposed pLSTM, we conduct a series of experiments on MSCOCO and Flickr30K datasets, and the experimental results outperform some state-of-the-art image captioning methods.



中文翻译:

具有同步语义和视觉信息的并行融合LSTM,用于图像字幕

为了将动态语义和视觉信息在图像字幕的解码器部分中同步组合,我们提出了一种新颖的并行融合LSTM(pLSTM)结构。隐藏属性在每个时间步将两个具有属性和图像视觉信息的并行LSTM融合在一起,从而使属性和视觉信息相互补充或增强,以生成更准确的字幕。根据从属性LSTM到可视LSTM集成语义信息的不同方式,我们提出了两种模型pLSTM(pLSTM-A)和pLSTM(pLSTM-G)。pLSTM-A可以自动捕获关键的语义和视觉信息以生成标题,而pLSTM-G通过将语义信息同步到关键区域来直接调整视觉LSTM的隐藏状态。

更新日期:2021-02-16
down
wechat
bug