Elsevier

Pattern Recognition Letters

Volume 150, October 2021, Pages 57-75
Pattern Recognition Letters

Visual question answering in the medical domain based on deep learning approaches: A comprehensive study

https://doi.org/10.1016/j.patrec.2021.07.002Get rights and content

Highlights

  • Visual Question Answering in the Medical Domain (VQA-Med) is a new and very challenging domain.

  • We experiment with various Deep Learning approaches on the unique ImageCLEF’s VQA-Med dataset.

  • Our hierarchical model consists of several sub-models based on pre-trained CNNs to achieve 60.8 accuracy and 63.4 BLEU.

  • Our model leverages techniques like Data Augmentation, Global Average Pooling (GAP) and Ensembling.

  • We experiment with several approaches based on Multi-Task Learning, GANs and Seq2Seq models without gaining improvements.

Abstract

Visual Question Answering (VQA) in the medical domain has attracted more attention from research communities in the last few years due to its various applications. This paper investigates several deep learning approaches in building a medical VQA system based on ImageCLEF’s VQA-Med dataset, which consists of about 4K images with about 15K question-answer pairs. Due to the wide variety of the images and questions included in this dataset, the proposed model is a hierarchical one consisting of many sub-models, each tailored to handle certain questions. For that, a special model is built to classify the questions into four categories, where each category is handled by a separate sub-model. At their core, all of these models consist of pre-trained Convolution Neural Networks (CNN). In order to get the best results, extensive experiments are performed and various techniques are employed including Data Augmentation (DA), Multi-Task Learning (MTL), Global Average Pooling (GAP), Ensembling, and Sequence to Sequence (Seq2Seq) models. Overall, the final model achieves 60.8 accuracy and 63.4 BLEU score, which are competitive with the state-of-the-art results despite using less demanding and simpler sub-models.

Introduction

With the advances in the computer vision (CV) and natural language processing (NLP) fields, a new challenging task is proposed, which is Visual Question Answering (VQA), grabbing the attention of both research communities. VQA is about answering a specific question about a given image. Thus, there is a need to combine CV techniques that provide an understanding of the image’s content with NLP techniques that provide an understanding of the question and the ability to produce the answer. The difficulty level of the problem depends on the expected answer types whether they are yes/no, multiple choice or open-ended.

Recently, VQA has been applied to different specific domains such as the medical domain. VQA in the medical domain has many applications such as helping clinicians in decision making in order to enhance their confidence, image interpretation for medical students, automated system for disease diagnosing, and to answer patients’ questions that do not require a special visit to a doctor.

Medical VQA poses its own set of issues/challenges that are different from the ones faced in general-domain VQA. Some of these challenges are related to the processing of medical images and the difficulties in handling all kinds of images for different body parts and extracting regions of interest that vary greatly for the different medical cases and ailments. The other set of challenges are related to the understanding of the questions and the ability to process very technical medical terms as well as non-medical terms used by common users. The resources required to address all these challenges can be massive and there are many restrictions related to using them and integrating them into a single model.

The year 2018 witnessed the inauguration of a special challenge for VQA in the medical domain under the name: the VQA-Med challenge, which was organized by the reputable ImageCLEF conference [1]. The best Bilingual Evaluation Understudy (BLEU) [2] score achieved by the five participating teams was 16.2 [3], which is a very low score. However, this is expected due to the task difficulty. Moreover, the work on this problem is still in its early stages, and, with time, it is expected to improve by creating more reliable datasets and models. The problem with the 2018 dataset is that it encompasses several medical concepts within a rather small dataset. In 2019, the second instalment of the VQA-Med challenge [4] was launched with a dataset that is even more comprehensive and diverse. It consists of four question/data categories, aiming to provide a complete medical VQA system. This dataset is the one we consider.

This work aims to gauge the effectiveness of different deep learning techniques in solving the task at hand. Therefore, an accurate and efficient medical VQA system is built consisting of several sub-models, where each sub-model is specialized in answering a specific category of questions. These sub-models vary in the used deep learning techniques and each is a result of extensive experimentation in order to reach the best performance for its respective category. These techniques include using pre-trained CNN models (with and without auxiliary information), Data Augmentation (DA), Global Average Pooling (GAP), and ensembling. All of these techniques improve the effectiveness of the overall model in solving the VQA-Med 2019 challenge. On the other hand, we also perform experiments using techniques that seem promising, but result in no improvement, such as Sequence to Sequence (Seq2Seq) models (with the encoder-decoder architecture, image captioning models, and attention mechanisms) in addition to some advanced techniques such as Multi-Task Learning (MTL) and Generative Adversarial Networks (GAN).1 The contributions of this work are as follows.

  • We present a medical VQA model based on the VQA-Med 2019 challenge. This model is simpler than the state-of-the-art (SOTA) models while achieving competitive performance.

  • While building this model, we explore the use of many cutting-edge techniques, some of which are shown to be useful while the other are not. This will help guide future research in this area.

  • We perform an extensive set of experiments on the presented model and its components and analyze their results. We investigate the model’s weaknesses and present justifications for some of the failed cases.

The rest of this paper is organized as follows: Section 2 presents the most relevant work. Section 3 presents a detailed analysis of the dataset, which we find useful in building our models. In Section 4, we present the proposed models that achieve the best performance for answering questions of each category. These models include basic image classification models, image classification models with auxiliary information, DA models, GAP models, and ensemble models. Section 5 lists and analyzes the experiment results for the proposed work in Section 4. Section 6 summarizes the achieved work and lists the possible work that can be done in the future. Finally, the appendices contain more details about the dataset and other proposed techniques that do not yield good performance including MTL models, Seq2Seq models with their variations, and the GANs models.

Section snippets

Related works

The general VQA challenge2 which is held every year starting from 2016, is based on a large dataset of real-world images with different question types such as yes/no questions, questions about quantities, etc. Different approaches were applied for the task and most solutions rely on deep learning techniques. These techniques combine the use of word embedding with different recurrent neural networks (RNNs) for text embedding and features extraction, and

Dataset

The dataset used in VQA-Med 2019 was generated from the MedPix3 database. It consists of: 3,200 medical images with 12,792 Question-Answer (QA) pairs as training data, 500 medical images with 2000 QA pairs as validation data, and 500 medical images with 500 QA pairs as test data. The data is equally distributed over the four categories of plane, organ, modality, and abnormality categories. Each image has a question in each one of these four categories. The images

Methodology

Since there are different types of questions from different categories in this work, a special model for each category is created. These models are combined into one model to be used for predicting answers. In order to use them correctly to answer a given question about a given image, we need to detect the suitable model to answer the question on the image using the question words.

A model is built to classify the question category. It is a rule-based model that does not require training (i.e.,

Results and analysis

The experiments to evaluate the proposed models and their results are discussed in this section. First, the evaluation results for each of the proposed models on the validation data are reported in order to find the best model for each of the four categories. After that, the results for the testing data are reported. The evaluation metrics are accuracy and cumulative 4-gram BLEU score. For all models, experiments are conducted using different optimizers [40] (which are RMSprop, Adagrad,

Conclusions and future work

In this paper, a medical VQA system is built based on the dataset provided by ImageCLEF VQA-Med 2019 challenge. The system consists of several sub-model. A special sub-model is built to classify the question category with 100% accuracy, and hence determine the appropriate model to answer it. The sub-models used for answering questions are based on the pre-trained VGG model. The best model overall accuracy is 60.8% with 63.4 BLEU score. Accuracy of plane, organ, and modality models are very good

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

We would like to thank the Deanship of Research at the Jordan University of Science and Technology for supporting this work (Grant #20190180). We would also like to thank Dr. Asma’ Al-Mnayyis, a Radiologist from the College of Medicine at Yarmouk University, Jordan, for her help with the medical concepts related to the dataset.

References (49)

  • Q. Wu et al.

    Visual question answering: a survey of methods and datasets

    Comput. Vision Image Understanding

    (2017)
  • B. Ionescu

    ImageCLEF 2019: multimedia retrieval in medicine, lifelogging, security and nature

    Experimental IR Meets Multilinguality, Multimodality, and Interaction

    (2019)
  • K. Papineni et al.

    Bleu: a method for automatic evaluation of machine translation

    Proceedings of the 40th annual meeting on association for computational linguistics

    (2002)
  • S.A. Hasan et al.

    Overview of the ImageCLEF 2018 medical domain visual question answering task

    CLEF2018 Working Notes

    (2018)
  • A. Ben Abacha et al.

    Vqa-med: Overview of the medical visual question answering task at imageclef 2019

    CLEF 2019 Working Notes

    (2019)
  • A.K. Gupta

    Survey of visual question answering: datasets and techniques

    arXiv preprint arXiv:1705.03865

    (2017)
  • K. Kafle et al.

    Answer-type prediction for visual question answering

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • K. Simonyan et al.

    Very deep convolutional networks for large-scale image recognition

    arXiv preprint arXiv:1409.1556

    (2014)
  • K. He et al.

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • T. Mikolov et al.

    Distributed representations of words and phrases and their compositionality

    Advances in Neural Information Processing Systems

    (2013)
  • J. Pennington et al.

    Glove: Global vectors for word representation

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    (2014)
  • M. Malinowski et al.

    Ask your neurons: A neural-based approach to answering questions about images

    Proceedings of the IEEE International Conference on Computer Vision

    (2015)
  • M. Ren et al.

    Image question answering: a visual semantic embedding model and a new dataset

    Proc. Advances in Neural Inf. Process. Syst

    (2015)
  • K.J. Shih et al.

    Where to look: Focus regions for visual question answering

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • Y. Zhu et al.

    Visual7w: Grounded question answering in images

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2016)
  • Z. Yang et al.

    Stacked attention networks for image question answering

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • J. Lu et al.

    Hierarchical question-image co-attention for visual question answering

    Advances In Neural Information Processing Systems

    (2016)
  • Y. Peng et al.

    Umass at imageclef medical visual question answering (med-vqa) 2018 task

    Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10–14, 2018.

    (2018)
  • Y. Zhou et al.

    Employing inception-resnet-v2 and bi-lstm for medical domain visual question answering

    Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10–14, 2018.

    (2018)
  • B. Talafha et al.

    Just at vqa-med: A vgg-seq2seq model

    Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10–14, 2018.

    (2018)
  • A.B. Abacha et al.

    Nlm at imageclef 2018 visual question answering in the medical domain

    Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10–14, 2018.

    (2018)
  • I. Allaouzi et al.

    Deep neural networks and decision tree classifier for visual question answering in the medical domain

    Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10–14, 2018.

    (2018)
  • J. Devlin et al.

    Bert: pre-training of deep bidirectional transformers for language understanding

    arXiv preprint arXiv:1810.04805

    (2018)
  • A. Vaswani et al.

    Attention is all you need

    Advances in Neural Information Processing systems

    (2017)
  • Cited by (11)

    View all citing articles on Scopus
    View full text