Skip to main content
Log in

Visual question answering: a state-of-the-art review

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Visual question answering (VQA) is a task that has received immense consideration from two major research communities: computer vision and natural language processing. Recently it has been widely accepted as an AI-complete task which can be used as an alternative to visual turing test. In its most common form, it is a multi-modal challenging task where a computer is required to provide the correct answer for a natural language question asked about an input image. It attracts many deep learning researchers after their remarkable achievements in text, voice and vision technologies. This review extensively and critically examines the current status of VQA research in terms of step by step solution methodologies, datasets and evaluation metrics. Finally, this paper also discusses future research directions for all the above-mentioned aspects of VQA separately.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  • Agrawal A, Kembhavi A, Batra D, Parikh D (2017) C-vqa: A compositional split of the visual question answering (vqa) v1. 0 dataset. arXiv preprint arXiv:1704.08243

  • Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6077–6086

  • Andreas J, Rohrbach M, Darrell T, Klein D (2015) Deep compositional question answering with neural module networks. arXiv preprint. arXiv preprint arXiv:1511.02799

  • Andreas J, Rohrbach M, Darrell T, Klein D (2016) Neural module networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 39–48

  • Antol S, Zitnick CL, Parikh D (2014) Zero-shot learning via visual abstraction. In: European conference on computer vision. Springer, Cham, pp 401–416

  • Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp 2425–2433

  • Bai Y, Fu J, Zhao T, Mei T (2018) Deep attention neural tensor network for visual question answering. In: Computer vision–ECCV 2018: 15th European conference, Munich, Germany, September 8–14, 2018, Proceedings. Springer, vol 11216, p 20

  • Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3(2):1137–1155

    MATH  Google Scholar 

  • Ben-Younes H, Cadene R, Cord M, & Thome N (2017) Mutan: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp 2612–2620

  • Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146

    Article  Google Scholar 

  • Cao L, Gao L, Song J, Xu X, Shen HT (2017) Jointly learning attentions with semantic cross-modal correlation for visual question answering. In: Australasian database conference. Springer, Cham, pp 248–260

  • Cer D, Yang Y, Kong SY, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C, Sung YH (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175

  • Chaturvedi I, Satapathy R, Cavallari S, Cambria E (2019) Fuzzy common-sense reasoning for multimodal sentiment analysis. Pattern Recognit Lett 125:264–270

    Article  Google Scholar 

  • Chen K, Wang J, Chen LC, Gao H, Xu W, Nevatia R (2015) ABC-CNN: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960

  • Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:1406.1078

  • Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364

  • Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE computer society conference on computer vision and pattern recognition. CVPR 2005. IEEE, vol 1, pp 886–893

  • Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation. pp 376–380

  • Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  • Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika 1(3):211–218

    Article  MATH  Google Scholar 

  • Elman JL (1990) Finding structure in time. Cognit Sci 14(2):179–211

    Article  Google Scholar 

  • Fang Z, Liu J, Li Y, Qiao Y, Lu H (2019) Improving visual question answering using dropout and enhanced question encoder. Pattern Recognit 90:404–414

    Article  Google Scholar 

  • Feng, Y., Zhu, X., Li, Y., Ruan, Y., & Greenspan, M. (2018). Learning Capsule Networks with Images and Text. In Advances in neural information processing systems

  • Forsbom E (2003) Training a super model look-alike: featuring edit distance, n-gram occurrence, and one reference translation. In: Proceedings of the workshop on machine translation evaluation: towards systemizing MT evaluation. pp 29–36

  • Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847

  • Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? Dataset and methods for multilingual image question. In: Advances in neural information processing systems, pp 2296–2304

  • Gao P, Li H, Li S, Lu P, Li Y, Hoi SC, Wang X (2018) Question-guided hybrid convolution for visual question answering. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 485–501

  • Geman D, Geman S, Hallonquist N, Younes L (2015) Visual turing test for computer vision systems. In: Proceedings of the national academy of sciences. pp 201422953

  • Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp 1440–1448

  • Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 580–587

  • Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233

    Article  Google Scholar 

  • Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR. vol 1(2), p 3

  • Gurari D, Li Q, Stangl AJ, Guo A, Lin C, Grauman K, Luo J, Bigham JP (2018) VizWiz grand challenge: answering visual questions from blind people. arXiv preprint arXiv:1802.08218

  • Hasan SA, Ling Y, Farri O, Liu J, Lungren M, Müller H (2018) Overview of the ImageCLEF 2018 medical domain visual question answering task. In CLEF2018 working notes. CEUR Workshop proceedings, Avignon, France

  • He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 770–778

  • He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 2980–2988

  • Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  • Huang LC, Kulkarni K, Jha A, Lohit S, Jayasuriya S, Turaga P (2018) CS-VQA: visual question answering with compressively sensed images. arXiv preprint arXiv:1806.03379

  • Jabri A, Joulin A, van der Maaten L (2016) Revisiting visual question answering baselines. In: European conference on computer vision. Springer, Cham, pp 727–739

  • Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2901–2910

  • Kafle K, Kanan C (2016) Answer-type prediction for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4976–4984

  • Kafle K, Kanan C (2017a) An analysis of visual question answering algorithms. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 1983–1991

  • Kafle K, Kanan C (2017b) Visual question answering: datasets, algorithms, and future challenges. Comput Vis Image Underst 163:3–20

    Article  Google Scholar 

  • Kafle K, Price B, Cohen S, Kanan C (2018) DVQA: Understanding data visualizations via question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5648–5656

  • Kahou SE, Michalski V, Atkinson A, Kadar A, Trischler A, Bengio Y (2017) Figureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300

  • Kembhavi A, Salvato M, Kolve E, Seo M, Hajishirzi H, Farhadi A (2016) A diagram is worth a dozen images. In: European conference on computer vision. Springer, Cham, pp 235–251

  • Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

  • Kim JH, Lee SW, Kwak D, Heo MO, Kim J, Ha JW, Zhang BT (2016a) Multimodal residual learning for visual qa. In: Advances in neural information processing systems pp 361–369

  • Kim JH, On KW, Lim W, Kim J, Ha JW, Zhang BT (2016b) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325

  • Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Advances in neural information processing systems, pp. 3294–3302

  • Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Bernstein MS (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73

    Article  MathSciNet  Google Scholar 

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems pp 1097-1105)

  • Lao M, Guo Y, Wang H, Zhang X (2018) Cross-modal multistep fusion network with co-attention for visual question answering. IEEE Access

  • Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Advances in neural information processing systems pp 2177–2185

  • Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Linguist 3:211–225

    Article  Google Scholar 

  • Li M, Gu L, Ji Y, Liu C (2018) Text-guided dual-branch attention network for visual question answering. In: Pacific rim conference on multimedia. Springer, Cham, pp 750–760

  • Lienhart R, Maydt J (2002) An extended set of haar-like features for rapid object detection. In: Proceedings. 2002 international conference on image processing. IEEE, vol 1, pp I–I

  • Lin X, Parikh D (2016) Leveraging visual question answering for image-caption ranking. In: European conference on computer vision. Springer, Cham, pp 261–277

  • Lioutas V, Passalis N, Tefas A (2018) Explicit ensemble attention learning for improving visual question answering. Pattern Recognit Lett 111:51–57

    Article  Google Scholar 

  • Logeswaran L, Lee H (2018) An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893

  • Lowe DG (1999) Object recognition from local scale-invariant features. In: The proceedings of the seventh IEEE international conference on computer vision, 1999. IEEE, vol 2, pp 1150–1157

  • Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems. pp 289–297

  • Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 375–383

  • Ma L, Lu Z, Li H (2016) Learning to answer questions from image using convolutional neural network. In: AAAI. vol. 3(7), p 16

  • Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in neural information processing systems. pp 1682–1690

  • Malinowski M, Rohrbach M, Fritz M (2017) Ask your neurons: a deep learning approach to visual question answering. Int J Comput Vis 125(1–3):110–135

    Article  MathSciNet  Google Scholar 

  • Malinowski M, Doersch C, Santoro A, Battaglia P (2018) Learning visual question answering by bootstrapping hard attention. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 3–20

  • Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp 3111–3119

  • Miller GA, Charles WG (1991) Contextual correlates of semantic similarity. Lang Cognit Process 6(1):1–28

    Article  MathSciNet  Google Scholar 

  • Narasimhan M, Schwing AG (2018) Straight to the facts: learning knowledge base retrieval for factual visual question answering. In: Proceedings of the European conference on computer vision (ECCV). pp 451–468

  • Noh H, Hongsuck Seo P, Han B (2016) Image question answering using convolutional neural network with dynamic parameter prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 30–38

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318

  • Peng L, Yang Y, Bin Y, Xie N, Shen F, Ji Y, Xu X (2019) Word-to-region attention network for visual question answering. Multimedia Tools Appl 78(3):3843–3858

    Article  Google Scholar 

  • Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp 1532–1543

  • Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint ar-Xiv:1802.05365

  • Prakash BS, Sanjeev KV, Prakash R, Chandrasekaran K (2019) A survey on recurrent eural network architectures for sequential learning. In: Soft computing for problem solving. Springer, Singapore, pp 57–66

  • Ren H, Lu H (2018) Compositional coding capsule network with k-means routing for text classification. arXiv preprint arXiv:1810.09177

  • Ren M, Kiros R, Zemel R (2015a) Image question answering: a visual semantic embedding model and a new dataset. Proc Adv Neural Inf Process Syst 1(2):5

    Google Scholar 

  • Ren M, Kiros R, Zemel R (2015b) Exploring models and data for image question answering. In: Advances in neural information processing systems. pp 2953–2961

  • Ren S, He K, Girshick R, Sun J (2015c) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. pp 91–99

  • Ruwa N, Mao Q, Wang L, Dong M (2018) Affective visual question answering network. In: 2018 IEEE conference on multimedia information processing and retrieval (MIPR)

  • Sabour S, Frosst N, Hinton GE (2017). Dynamic routing between capsules. In: Advances in neural information processing systems. pp 3856–3866

  • Saito K, Shin A, Ushiku Y, Harada T (2017) Dualnet: domain-invariant network for visual question answering. In: 2017 IEEE international conference on multimedia and expo (ICME). IEEE, pp 829–834

  • Shah M, Chen X, Rohrbach M, Parikh D (2019) Cycle-consistency for robust visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6649–6658

  • Shi Y, Furlanello T, Zha S, Anandkumar A (2018) Question type guided attention in visual question answering. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 158–175

  • Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4613–4621

  • Shrestha R, Kafle K, Kanan C (2019) Answer them all! toward universal visual question answering models. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 10472–10481

  • Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  • Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. pp 3104–3112

  • Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1–9

  • Teney D, Hengel AV (2018) Visual question answering as a meta learning task. In: Computer vision—ECCV 2018 lecture notes in computer science. 229–245

  • Tommasi T, Mallya A, Plummer B, Lazebnik S, Berg AC, Berg TL (2019) Combining multiple cues for visual madlibs question answering. Int J Comput Vis 127(1):38–60

    Article  Google Scholar 

  • Toor AS, Wechsler H, Nappi M (2019) Question action relevance and editing for visual question answering. Multimedia Tools Appl 78(3):2921–2935

    Article  Google Scholar 

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you need. In: Advances in neural in-formation processing systems. pp 5998–6008

  • Wang P, Wu Q, Shen C, Hengel AVD, Dick A (2015) Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570

  • Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2018) Fvqa: fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427

    Article  Google Scholar 

  • Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 133–138

  • Wu Q, Wang P, Shen C, Dick A, van den Hengel A (2016). Ask me any-thing: Free-form visual question answering based on knowledge from exter-nal sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 4622-4630)

  • Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381

    Article  Google Scholar 

  • Xu W, Rudnicky A (2000) Can artificial neural networks learn language models?. In: sixth international conference on spoken language processing

  • Xu H, Saenko K (2016) Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In: European conference on computer vision. Springer, Cham, pp 451–466

  • Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 21–29

  • Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learn-ing based natural language processing. IEEE Comput Intell Mag 13(3):55–75

    Article  Google Scholar 

  • Yu L, Park E, Berg AC, Berg TL (2015) Visual madlibs: fill in the blank description generation and question answering. In: Proceedings of the ieee international conference on computer vision. pp 2461–2469

  • Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4187–4195

  • Yu D, Gao X, Xiong H (2018a) Structured semantic representation for visual question answering. In: 2018 25th IEEE international conference on image processing (ICIP). IEEE, pp 2286–2290

  • Yu Z, Yu J, Xiang C, Fan J, Tao D (2018b) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959

    Article  Google Scholar 

  • Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6281–6290

  • Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, Cham, pp 818–833

  • Zhang P, Goyal Y, Summers-Stay D, Batra D, Parikh D (2016) Yin and yang: balancing and answering binary visual questions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5014–5022

  • Zhao W, Peng H, Eger S, Cambria E, Yang M (2019) Towards scalable and reliable capsule networks for challenging NLP applications. arXiv preprint arXiv:1906.02829

  • Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167

  • Zhu Y, Zhang C, Ré C, Fei-Fei L (2015) Building a large-scale multimodal knowledge base system for answering visual queries. arXiv preprint arXiv:1507.05670

  • Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual7w: Grounded question answering in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4995–5004

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sruthy Manmadhan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Manmadhan, S., Kovoor, B.C. Visual question answering: a state-of-the-art review. Artif Intell Rev 53, 5705–5745 (2020). https://doi.org/10.1007/s10462-020-09832-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-020-09832-7

Keywords

Navigation