Abstract
Image captioning is a growing field of research that has taken hold of the research community. It is a challenging task owing to the complexity of natural language generation and the difficulty involved in feature extraction from a diverse collection of images. Many models have been proposed to tackle the problem, like state-of-the-art encoder-decoder (Sequential CNN-RNN) systems that have proved to be capable of obtaining results. Recently, Reinforcement learning has made itself the new approach to the problem and has been successful in surpassing many of the state-of-the-art paradigms. We have come up with a new reward system known as the BLUDEr metric, which is a linear combination of the non-differentiable metrics BLEU and CIDEr. We directly optimize this metric for our model, on natural language generation tasks. In our experiments, we use the Flickr30k and Flickr8k datasets, which have become two of the benchmark datasets when it comes to image captioning systems. We have achieved state-of-the-art results on these two datasets, when compared with other models.
Similar content being viewed by others
REFERENCES
T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision (2014), pp. 740–755.
B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 2641–2649.
M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” J. Artif. Intell. Res. 47, 853–899 (2013).
K. Papineni, S. Roukos, T. Ward, and Wei-Jing Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (2002), pp. 311–318.
R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in CVPR 2015 (2014).
S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems (2015), Vol. 1, pp. 1171–1179.
A. N. Burnetas and M. N. Katehakis, “Optimal adaptive policies for Markov decision processes,” Math. Oper. Res. 22 (1), 222–255 (1997).
S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning” (2016). arXiv:1612.00563.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput. 9 (8), 1735–1780 (1997).
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3156–3164.
Qi Wu, Ch. Shen, P. Wang, A. Dick, and A. van den Hengel, “Image captioning and visual question answering based on attributes and external knowledge,” IEEE Trans. Pattern Anal. Mach. Intell. 40 (6), 1367–1381 (2018).
Haichao Shi, Peng Li, Bo Wang, and Zhenyu Wang, “Image captioning based on deep reinforcement learning,” in Proceedings of the 10th International Conference on Internet Multimedia Computing and Service (ICIMCS) (Nanjing, 2018), pp. 45–49.
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning (2015), pp. 2048–2057.
A. Karpathy and Li Fei-Fei, “Deep visual semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3128–3137.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778.
X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, “Guiding the long-short term memory model for image caption generation,” in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 2407–2415.
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, and Tat-Seng Chua, “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 6298–6306.
Junqi Jin, Kun Fu, Runpeng Cui, Fei Sha, and Changshui Zhang, “Aligning where to see and what to tell: Image caption with region-based attention and scene factorization” (2015). arXiv:1506.06272.
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR (2015).
P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in European Conference on Computer Vision (2016), pp. 382–398.
M. J. Kusner, Yu Sun, N. I. Kolkin, and K. Q. Weinberger, “From word embeddings to document distances,” in ICML (2015), pp. 957–966.
Chin-Yew Lin, “Rouge: A package for automatic evaluation of summaries,” in Text Summarization Branches out, Proceedings of the ACL-04 Workshop (Barcelona, 2004), Vol. 8.
S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005), Vol. 29, pp. 65–72.
ACKNOWLEDGMENTS
This work was carried out at the ‘Centre for Data Sciences and Applied Machine Learning’ (CDSAML) of PES University.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
The authors declare that they have no conflicts of interest.
This article does not contain any studies involving human participants performed by any of the authors.
Additional information
P Rama Devi received her Master degree in Computer Science from Jawaharlal Nehru Technological University, Hyderabad, India in 2009. Currently she is working on her PhD at Visvesvaraya Technological University, Belgaum, India. She was working as Assistant professor from 2005 to 2010 in SreeVidyanikethan Engineering College, Tirupati, India and from 2010 to till date she is working for PES University, Bangalore, India, formerly known as PESIT. She has published 2 conference and 2 journal papers. She has permanent IEEE membership. Her areas of interest include Image Processing, and Computer Vision.
Thrivikraman V is currently pursuing his undergraduate degree in Computer Science and Engineering at PES University, Bangalore. He is a research scholar at the Centre for Data Sciences and Applied Machine Learning (CDSAML) at PES University. He is avidly interested in developing intelligent systems to solve predominant challenges in the healthcare domain. His research areas include Image Processing, Computer Vision, and Machine Learning.
Dhruva Kashyap is currently pursuing his undergraduate degree in Computer Science and Engineering at PES University, Bangalore. He is a research scholar at the Centre for Data Sciences and Applied Machine Learning (CDSAML) at PES University. His research interests include Image Processing, Computer Vision, and Machine Learning.
Dr. Shylaja S S is the Chairperson, Department of Computer Science at PES University, Bangalore. She obtained her Master Degree from Sri Jayachamarajendra College of Engineering, Mysore University in 1993 and PhD in the domain of Face Recognition from Visvesvaraya Technological University (VTU), Belgaum, India. She has 30 years of teaching experience and 15 years of research experience. She has several Journal articles, National and International conference publications to her credit. She is a member of Board of Studies of several organizations, reviewer of conferences and member of many Technical Committees. Dr.Shylaja is heading the Centre for Data Sciences and Applied Machine Learning (CDSAML) at PES University facilitating research and internship opportunities for students and faculty members. Her research areas include Image Processing, Computer Vision, Natural Language Processing and Machine Learning.
Rights and permissions
About this article
Cite this article
Devi, P.R., Thrivikraman, V., Kashyap, D. et al. Image Captioning using Reinforcement Learning with BLUDEr Optimization. Pattern Recognit. Image Anal. 30, 607–613 (2020). https://doi.org/10.1134/S1054661820040094
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1054661820040094