Skip to main content
Log in

Image Captioning using Reinforcement Learning with BLUDEr Optimization

  • MATHEMATICAL THEORY OF IMAGES AND SIGNALS REPRESENTING, PROCESSING, ANALYSIS, RECOGNITION, AND UNDERSTANDING
  • Published:
Pattern Recognition and Image Analysis Aims and scope Submit manuscript

Abstract

Image captioning is a growing field of research that has taken hold of the research community. It is a challenging task owing to the complexity of natural language generation and the difficulty involved in feature extraction from a diverse collection of images. Many models have been proposed to tackle the problem, like state-of-the-art encoder-decoder (Sequential CNN-RNN) systems that have proved to be capable of obtaining results. Recently, Reinforcement learning has made itself the new approach to the problem and has been successful in surpassing many of the state-of-the-art paradigms. We have come up with a new reward system known as the BLUDEr metric, which is a linear combination of the non-differentiable metrics BLEU and CIDEr. We directly optimize this metric for our model, on natural language generation tasks. In our experiments, we use the Flickr30k and Flickr8k datasets, which have become two of the benchmark datasets when it comes to image captioning systems. We have achieved state-of-the-art results on these two datasets, when compared with other models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.

Similar content being viewed by others

REFERENCES

  1. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision (2014), pp. 740–755.

  2. B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 2641–2649.

  3. M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” J. Artif. Intell. Res. 47, 853–899 (2013).

    Article  MathSciNet  Google Scholar 

  4. K. Papineni, S. Roukos, T. Ward, and Wei-Jing Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (2002), pp. 311–318.

  5. R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in CVPR 2015 (2014).

  6. S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems (2015), Vol. 1, pp. 1171–1179.

  7. A. N. Burnetas and M. N. Katehakis, “Optimal adaptive policies for Markov decision processes,” Math. Oper. Res. 22 (1), 222–255 (1997).

    Article  MathSciNet  Google Scholar 

  8. S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning” (2016). arXiv:1612.00563.

  9. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput. 9 (8), 1735–1780 (1997).

    Article  Google Scholar 

  10. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3156–3164.

  11. Qi Wu, Ch. Shen, P. Wang, A. Dick, and A. van den Hengel, “Image captioning and visual question answering based on attributes and external knowledge,” IEEE Trans. Pattern Anal. Mach. Intell. 40 (6), 1367–1381 (2018).

    Article  Google Scholar 

  12. Haichao Shi, Peng Li, Bo Wang, and Zhenyu Wang, “Image captioning based on deep reinforcement learning,” in Proceedings of the 10th International Conference on Internet Multimedia Computing and Service (ICIMCS) (Nanjing, 2018), pp. 45–49.

  13. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning (2015), pp. 2048–2057.

  14. A. Karpathy and Li Fei-Fei, “Deep visual semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3128–3137.

  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778.

  16. X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, “Guiding the long-short term memory model for image caption generation,” in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 2407–2415.

  17. Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, and Tat-Seng Chua, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 6298–6306.

  18. Junqi Jin, Kun Fu, Runpeng Cui, Fei Sha, and Changshui Zhang, “Aligning where to see and what to tell: Image caption with region-based attention and scene factorization” (2015). arXiv:1506.06272.

  19. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR (2015).

  20. P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in European Conference on Computer Vision (2016), pp. 382–398.

  21. M. J. Kusner, Yu Sun, N. I. Kolkin, and K. Q. Weinberger, “From word embeddings to document distances,” in ICML (2015), pp. 957–966.

  22. Chin-Yew Lin, “Rouge: A package for automatic evaluation of summaries,” in Text Summarization Branches out, Proceedings of the ACL-04 Workshop (Barcelona, 2004), Vol. 8.

  23. S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005), Vol. 29, pp. 65–72.

Download references

ACKNOWLEDGMENTS

This work was carried out at the ‘Centre for Data Sciences and Applied Machine Learning’ (CDSAML) of PES University.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to P. R. Devi, V. Thrivikraman, D. Kashyap or S. S. Shylaja.

Ethics declarations

The authors declare that they have no conflicts of interest.

This article does not contain any studies involving human participants performed by any of the authors.

Additional information

P Rama Devi received her Master degree in Computer Science from Jawaharlal Nehru Technological University, Hyderabad, India in 2009. Currently she is working on her PhD at Visvesvaraya Technological University, Belgaum, India. She was working as Assistant professor from 2005 to 2010 in SreeVidyanikethan Engineering College, Tirupati, India and from 2010 to till date she is working for PES University, Bangalore, India, formerly known as PESIT. She has published 2 conference and 2 journal papers. She has permanent IEEE membership. Her areas of interest include Image Processing, and Computer Vision.

Thrivikraman V is currently pursuing his undergraduate degree in Computer Science and Engineering at PES University, Bangalore. He is a research scholar at the Centre for Data Sciences and Applied Machine Learning (CDSAML) at PES University. He is avidly interested in developing intelligent systems to solve predominant challenges in the healthcare domain. His research areas include Image Processing, Computer Vision, and Machine Learning.

Dhruva Kashyap is currently pursuing his undergraduate degree in Computer Science and Engineering at PES University, Bangalore. He is a research scholar at the Centre for Data Sciences and Applied Machine Learning (CDSAML) at PES University. His research interests include Image Processing, Computer Vision, and Machine Learning.

Dr. Shylaja S S is the Chairperson, Department of Computer Science at PES University, Bangalore. She obtained her Master Degree from Sri Jayachamarajendra College of Engineering, Mysore University in 1993 and PhD in the domain of Face Recognition from Visvesvaraya Technological University (VTU), Belgaum, India. She has 30 years of teaching experience and 15 years of research experience. She has several Journal articles, National and International conference publications to her credit. She is a member of Board of Studies of several organizations, reviewer of conferences and member of many Technical Committees. Dr.Shylaja is heading the Centre for Data Sciences and Applied Machine Learning (CDSAML) at PES University facilitating research and internship opportunities for students and faculty members. Her research areas include Image Processing, Computer Vision, Natural Language Processing and Machine Learning.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Devi, P.R., Thrivikraman, V., Kashyap, D. et al. Image Captioning using Reinforcement Learning with BLUDEr Optimization. Pattern Recognit. Image Anal. 30, 607–613 (2020). https://doi.org/10.1134/S1054661820040094

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1054661820040094

Keywords:

Navigation