Skip to main content
Log in

Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

In the existing image captioning methods, masked convolution is usually used to generate language description, and traditional residual network (ResNets) methods used for masked convolution bring about the vanishing gradient problem. To address this issue, we propose a new image captioning framework that combines dense fusion connection (DFC) and improved stacked attention module. DFC uses dense convolutional networks (DenseNets) architecture to connect each layer to any other layer in a feed-forward fashion, then adopts ResNets method to combine features through summation. The improved stacked attention module can capture more fine-grained visual information highly relevant to the word prediction. Finally, we employ the Transformer to the image encoder to sufficiently obtain the attended image representation. The experimental results on MS-COCO dataset demonstrate the proposed model can increase CIDEr score from \(91.2 \%\) to \(106.1 \%\), which has higher performance than the comparable models and verifies the effectiveness of the proposed model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Vinyals O, Toshev A, Bengio S, et al (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  2. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: International conference on neural information processing systems, pp 1097–1105

  3. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.15566

  4. He K, Zhang X, Ren S, et al(2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770-778

  5. Hochreiter S, Schmidhuber J (1997) Long short-term memory. In: Neural computation, pp 1735–1780

  6. Xu K, Ba J, Kiros R, et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of international conference on machine learning, pp 2048–2057

  7. Mikolov T, Karafiat M, Burget L, et al (2010) Recurrent neural network based language model. In: 11th Annual conference of the international speech communication association, pp 1045–1048

  8. Anderson P, He X D, Buehler C, et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. arXiv preprint arXiv:1707.07998

  9. Chen S Z, Jin Q, Wang P, Wu Q (2020) Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition. arXiv:2003.00387

  10. Ren S, He K, Girshick R et al (2015) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  11. Luo FL, Zhang LP, Du B et al (2020) Dimensionality reduction with enhanced hybrid-graph discriminant learning for hyperspectral image classification. IEEE Trans Geosci Remote Sens 58(8):5336–5353

    Article  Google Scholar 

  12. Shi GY, Huang H, Wang LH (2020) Unsupervised dimensionality eeduction for hyperspectral imagery via local geometric structure feature learning. IEEE Geosci Remote Sens Lett 17(8):1425–1429

    Article  Google Scholar 

  13. Aneja J, Deshpande A, Schwing A (2017) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5561–5570

  14. Wang Q, Chan AB (2018) CNN+CNN: convolutional decoders for image captioning. arXiv preprint arXiv:1805.09019

  15. Oord A, Kalchbrenner N, Vinyals O, et al (2016) Conditional image generation with pixelcnn decoders. arXiv preprint arXiv:1606.05328

  16. Gu J, Cai J, Wang G, et al(2018) Stack-captioning: coarse-to-fine learning for image captioning. arXiv preprint arXiv:1709.03376

  17. Yang Z C, He X D, Gao JF, et al (2016) Stacked attention networks for image question answering. arXiv preprint arXiv:1511.02274

  18. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: The thirty-first annual conference on neural information processing systems, pp 4–9

  19. Zhang W, Ying Y, Lu P, et al (2020) Learning long-and short-term user literal-preference with multimodal hierarchical transformer network for personalized image caption (2020). In: The thirty-fourth AAAI conference on artificial intelligen, pp 5971–9578

  20. Huang G, Liu Z, Weinberger KQ, et al (2017) Densely connected convolutional networks. arXiv preprint arXiv:1608.06993

  21. Szegedy C, Liu W, Jia Y, et al(2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  22. Karpathy A, Li F F(2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  23. Su YT, Li YQ, Xu N et al (2019) Hierarchical deep neural network for image captioning. Neural Process Lett 52:1057–1067. https://doi.org/10.1007/s11063-019-09997-5

    Article  Google Scholar 

  24. Fang H, Gupta S, Iandola F, et al. (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482

  25. Gan Z, Gan C, He X D, et al (2017) Semantic compositional networks for visual captioning, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1141–1150

  26. Liu F, Ren X C, Liu Y X, et al (2018) SimNet: Stepwise image-topic merging network for generating detailed and comprehensive image captions. In: Computation and Language, pp 137–149

  27. Gu J X, Joty S, Cai J F, et al (2019) Unpaired image captioning via scene graph alignments. In: International conference on computer vision, pp 10322–10331

  28. Lu J, Yang J, Batra D, et al (2018) Neural baby talk. arXiv preprint arXiv:1803.09845

  29. Pan B, Xu X, Shi ZW et al (2020) DSSNet: A simple dilated semantic segmentation network for hyperspectral imagery classification. IEEE Geosci Remote Sens Lett 17(11):1968–1972

    Article  Google Scholar 

  30. Li D, Wang Q, Kong FQ (2020) Superpixel-feature-based multiple kernel sparse representation for hyperspectral image classification. Signal Process 176:107682. https://doi.org/10.1016/j.sigpro.2020.107682

    Article  Google Scholar 

  31. Donahue J, Hendricks L A, Rohrbach M, et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 677–691

  32. Chen M, Ding G, Zhao S, et al. (2017) Reference based LSTM for image captioning. In: Proceedings of the thirty-first AAAI conference on artificial intelligence, pp 3981-3987

  33. Jia X, Gavves E, Fernando B, et al (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2407–2415

  34. Karpathy A, Joulin A, Li F F (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889-1897

  35. Cao PF, Yang ZY, Sun L et al (2019) Image captioning with bidirectional semantic attention-based guiding of long short-term memory. Neural Process Lett 50:103–119

    Article  Google Scholar 

  36. Lu J, Xiong C, Parikh D,et al (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3242-3250

  37. Cornia M, Baraldi L, Cucchiara R (2019) Show, control and tell: a framework for generating controllable and grounded captions. arXiv preprint arXiv:1811.10652

  38. Li L, Tang S, Deng L, et al (2017) Image caption with global-local attention. In: Proceedings of the national conference on artificial intelligence, pp 4133-4139

  39. Zhu XX, Li LX, Liu J et al (2018) Captioning transformer with stacked attention modules. Appl Sci 8(5):739

    Article  Google Scholar 

  40. Yu J, Li J, Yu Z, et al (2019) Multimodal transformer with multi-view visual representation for image captioning. arXiv preprint arXiv:1905.07841

  41. Lin T Y, Maire M, Belongie S et al(2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755

  42. Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 311–318

  43. Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380

  44. Lin C (2004) Rouge: a package for automatic evaluation of summaries. Meeting of the association for computational linguistics, pp 74–81

  45. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566-4575

  46. Zhang T Y, Kishore V, Wu F, et al (2019) Bertscore: Evalbuating text generation with bert. arXiv preprint arXiv:1904.09675

  47. Gu J X, Joty S, Cai J F, et al (2020) Improving image captioning evaluation by considering inter references variance. In: International conference on computer vision, pp 10322–10331

  48. Mao J, Xu W, Yang Y, et al (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632

  49. Chen L, Zhang H, Xiao J, et al (2017) SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6298–6306

  50. Wang H Z, Wang H L, Xu K S (2019) Swell-and-shrink: decomposing image captioning by transformation and summarization. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence, pp 5226–5232

  51. Zhou Y, Sun Y, Honavar V(2019) Improving image captioning by leveraging knowledge graphs. In 2019 IEEE winter conference on applications of computer vision, pp 283–293

  52. Huang Y, Chen J, OUyang W et al (2020) Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Trans Image Process 29:4013–4026

    Article  Google Scholar 

Download references

Acknowledgements

the Natural Science Foundation of Liaoning Province(No. 2020-MS-080),the Fundamental Research Funds for the Central Universities (No. N2005032), Key projects of Natural Science Foundation of Liaoning Province(No. 2017012074-301).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiangde Zhang.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, H., Wang, R. & Zhang, X. Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module. Neural Process Lett 53, 1101–1118 (2021). https://doi.org/10.1007/s11063-021-10431-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-021-10431-y

Keywords

Navigation