Introduction

COVID-19 is an infectious disease that has infected more than 4.5 million individuals all over the world until May 14 in 2020 [1]. The current tests for diagnosis of this disease are mostly based on reverse transcription-polymerase chain reaction (RT-PCR). However, RT-PCR test kits are in huge shortage and take 4–6 h to obtain results, which are a long time compared with the rapid spreading rate of COVID-19. As a result, many infected cases cannot be detected promptly and continue to infect others unconsciously. So, many efforts have been done for alternative testing methods. Computed tomography (CT) scans devices are promising in serving as fast speed, more efficient and accessible testing manner [2]. The diagnosis of COVID-19 is based on the evaluation and assessment of CT images by the radiologist. But, this work is tedious and often has a high degree of inter-observer variability that results in uncertainty. Thus, to overcome the mentioned limitations, an automatic, reliable and reproducible approach using advanced machine learning is required. This system can overcome these limitations and can be utilized everywhere with no need for a highly trained radiologist.

Different methods of machine learning are widely used in medical image processing. Among them, in recent years, deep learning methods have been introduced as the newest and most advanced methods. DL methods that rely on multi-layered neural networks can extract and learn increasingly complex, task-related features directly from the data. Recent developments in neural network architecture design and training have enabled researchers to solve previously intractable learning tasks of DL methods. But, these models need a lot of data for good training and prevent overfitting. To address this problem, transfer learning methods are introduced. Transfer learning is done by considering a standard neural architecture along with its pre-trained weights on large datasets and then fine-tuning the weights on the target task that has limited training data [3]. In other words, our strategy is to learn a powerful deep network to extract comprehensive visual features by pretraining on large datasets and then adapt the network weights of this pre-trained deep network to the target task with the small-sized dataset. As a result, several types of research in recent years have focused on the application of deep transfer learning in a wide range of medical image classification and recognition tasks with very success [4,5,6,7], especially in skin cancer [8], pneumonia detection [9], tumor classification [10], cardiovascular [11], ophthalmology [12], musculoskeletal [13] and histopathological images [14].

Medical images analysis from chest X-ray and CT images is very important for early and accurate diagnosis pneumonia and similar pulmonary disease and assists for effective treatments. Ke et al. [15] have proposed the hybrid neuro-heuristic and neuro-fuzzy methods for detection small changes in the structure of lung tissues and Chandra et al. [16] tuberculosis-related abnormalities detection using hierarchical feature extraction method from the Chest X-ray images. Peláez et al. [17] have proposed a novel methodology for automated identification of interstitial lung abnormalities patterns using an ensemble of deep convolutional neural networks, and Gupta et al. [18] identify chronic obstructive pulmonary disease and fibrosis using extracting relevant features, feature selection and identification using a machine learning classifier in lung CT images. Recently, there have been some studies for automatic detection of COVID-19 based on CT scans using DL methods. Xu et al. have proposed multiple CNN models [19], Zheng et al. a 3D deep CNN (DeCoVNet) [20] and Wang et al. a modified inception transfer learning model [21] to detect COVID-19 based on chest CT scan. Also, Shan et al. employed the “VB-Net” neural network to segment COVID-19 infection regions [22] and Chen et al. UNet +  + for identification of COVID-19 from CT scans [23]. The above studies suggest that DL is a feasible task, but were all evaluated in a limited setting.

The novelty and contribution of this paper are an automatic methodology based on an ensemble deep transfer learning system with different pre-trained CNNs architectures on a publicly available dataset of CT images for the diagnosis of COVID-19. We design the optimal combination of deep transfer learning outputs using an additional ensemble network.

Material and method

Database

We have used a publicly available COVID19-CT dataset [24], which consists of 349 CT scans labeled as being positive for COVID-19 from 216 patient cases and 397 negative COVID-19 CT scans that are normal or contain other types of lung diseases from 171 cases. The utility of this dataset has been confirmed by a senior radiologist in Tongji Hospital, Wuhan, China, who has performed diagnosis and treatment of a large number of COVID-19 patients between January and April. More details about the used dataset are described in this paper [20]. The minimum, maximum and average number of CT scans for a patient are 1.0, 16.0 and 1.6, respectively. These CT images have different sizes. The average, maximum and minimum height are 491, 1853 and 153 pixels, respectively. The average, maximum and minimum width are 383, 1485 and 124 pixels, respectively. Figure 1 shows some examples of the positive COVID-19 CT scans.

Fig. 1
figure 1

Examples of positive COVID-19 CT scans

Deep learning and Convolutional neural networks

Deep learning algorithms compared with other conventional machine learning methods have become particularly popular for the diagnosis of diseases in medical imaging with considerable performance improvements. One of the most popular DL methods in the field of medical imaging is CNN [25]. It is the state-of-the-art DL methodology consisting of many stacked convolutional layers. The CNN structure comprises a convolutional layer, a maximum or average pooling layer, a nonlinear layer, batch normalization, fully connected (FC) layers and finally a softmax layer. Pooling layers are frequently used among convolutional layers to boost translational invariance and lessen feature map extent. Nonlinear layers (mostly ReLU function) are used to strengthen the network for solving nonlinear problems. Finally, FC layers prepare extracted features to be classified by the softmax layer.

Transfer learning

The numbers of parameters in the model increase as networks get deeper for improved learning efficiency. The deeper networks lead to the more complicated computations and the more demanding to training data. We have only 387 cases with 746 images to train, validate and test data. It seems too small to train a deep CNN, and we need the transfer learning concept. Transfer learning employs take advantage of a pre-trained model (CNN model) on a huge database to help with the learning of the target task (e.g., diagnosing COVID-19 from CT scans) that has limited training data [26]. It means we transfer the information (learning expressive and generalizable feature representations) to our problem with an insufficient database. When applying this strategy, we have one problem. The large image data mostly belong to the general domain, such as cat, dog and chair, whereas our images are COVID-19 CTs with different visual appearances. As a result, the visual representations learned on these large images may not be able to represent CT images well. It makes the network extraction feature biased with the source data and less generalized on the target data. So, the pre-trained models are then fine-tuned on our new dataset with a lesser number of training images. In other words, pre-trained CNN structures are modified to suit our task. This procedure is usually much faster than the conventional training of the CNN model with random weights. To train enormous parameters of a CNN model adequately, too much data are needed. We use transfer learning to compensate the lack of many datasets and achieving better outcomes. Several special CNN architectures are trained on very large amounts of images with many categories and then named pre-trained CNNs model. There are trained on ImageNet contains 14 million images of 1000 different categories from animals (dogs, cats, lions, ….) to objects (desks, pens, chairs, …) [27]. EfficientNets(B0-B5) [28], NasNetLarge, NasNetMobile [29], InceptionV3 [30], ResNet-50 [31], SeResnet50 [32], Xception [33], DenseNet121 [34], ResNext50 [35] and Inception_resnet_v2 [36] are popular pre-trained CNNs. These networks have benefits for researchers such as lower training time, weaker and cheaper hardware requirements, and lower computational load.

Training details

We evaluated pre-trained CNNs by fine-tuning them on our clinical dataset, separately. For this reason, we adopted 15 pre-trained CNNs: EfficientNets(B0-B5), NasNetLarge, NasNetMobile, InceptionV3, ResNet-50, SeResnet50, Xception, DenseNet121, ResNext50 and Inception_resnet_v2. Image augmentation is used in our study. In addition to the initial dataset, another training dataset by the technique of data augmentation is generated. Data augmentation can be applied during training to reduce the overfitting problem of deep CNN. In this research, we applied randomly horizontally and vertically shift to an extent of 10% of the original dimensions. Also random rotation (20°) combined with small random zoom was applied to the training images. We also flip horizontally the images to increase the size of dataset. To fine-tuning all networks, we only used the convolutional part of each model’s architecture, removing all fully connected layers. On top of the last convolutional layer, we added a global average pooling layer, followed by the final classification layer that uses softmax non-linearity. For fine-tuning the networks, all models were fine-tuned for 50 epochs using stochastic gradient descent (SGD) optimization with an initial learning rate of 0.0001, Nesterov momentum 0.9 and the batch size of 32. In all cases, the categorical cross-entropy was used as the loss function. Hyperparameters are tuned on the validation set. It should be noted that the input of each network is of a different size. So in the first step of data preparation, according to different sizes of model inputs, all images were resized to proper sizes and stored in separate folders. Table 1 compares the pre-trained CNN models. These models were trained using the same initialization and learning rate policies.

Table 1 Comparison of the analyzed pre-trained CNNs

Evaluation metrics

Independently, 15 versions of the pre-trained CNNs model were fine-tuned and the prediction was used on the test set. Accordingly, the dataset was split into a training set, a validation set and a test set (60%–15%–25%). Common classification metrics named: accuracy, precision, recall, F1 score and AUC, which is the area under the receiver operating characteristic (ROC) curve, were also used for evaluation of the proposed method. F1-score is the harmonic mean of precision and recall. AUC is a performance measurement for classification problems at various thresholds settings that represents a degree of separability and tells how much model is capable of distinguishing between classes.

Results

Independently, 15 versions of pre-trained CNNs model: EfficientNets(B0–B5), NasNetLarge, NasNetMobile, InceptionV3, ResNet-50, SeResnet50, Xception, DenseNet121, ResNext50 and Inception_resnet_v2 on ImageNet datasets were fine-tuned on our dataset of COVID19-CT images, with the goal of transferring the information into our task that has limited training data. DL was performed using Python version 3.5 programming language (Python Software Foundation, Beaverton, Oregon) with Keras version 2.1.5 software (GitHub, San Francisco, California) using a graphics processing unit (GeForce GTX 1080 Ti, NVIDIA, Santa Clara, California). Figures 2 and 3 show the value of the accuracy and loss function in the training sets for fine-tuning of different pre-trained CNN models, respectively. As shown in these figures, the model converges in the training process after the 50th epoch, and the data distribution ranges were narrow. So, after 50 steps of training, diagnostic accuracy was calculated using the test set. The results of transfer learning using different structures are illustrated in Table 2 by common classification metrics, accuracy, precision, recall, F1 score and AUC. The EfficientNetB0, EfficientNetB5 and InceptionV3 give higher accuracy (0.82) in the classification of CT images as compared to other architectures. But, the precision (0.847) and recall (0.822) metrics of the EfficientNetB0 model is the best among the other pre-trained CNNs model. Results of the ROC-AUC analysis used to assess the diagnostic ability and classification of CT images at different pre-trained CNN models are shown in Fig. 4. The DL architecture with the largest AUC was EfficientNetB0 (AUC: 0.907), InceptionV3 (AUC: 0.897) and EfficientNetB5 (AUC: 0.886).

Fig. 2
figure 2

Training accuracy for different pre-trained CNN models

Fig. 3
figure 3

Cross-entropy loss function for different pre-trained CNN models

Table 2 Classification metrics on the test dataset using the different architecture of deep transfer learning models and also proposed ensemble method. For each model, average (± std.) performance measure is reported over the best 5 trained model checkpoints
Fig. 4
figure 4

ROC/AUC curves of different architecture of deep transfer learning models

To improve the classification accuracy, an ensemble method has been developed, where different architectures of deep transfer learning outputs as different classifiers are fused using additional work. A majority voting criterion was used by the ensemble of classifiers to assign the final prediction to the test data. For a given test image, the outputs of the classifiers are averaged to generate the final output of the ensemble. We experimented with ensembles of 3, 5, 7, 9, 11, 13 and 15 different architectures of deep transfer learning outputs. The experimental results indicate that the majority voting of 5 deep transfer learning architecture with EfficientNetB0, EfficientNetB3, EfficientNetB5, Inception_resnet_v2 and Xception achieves best results in diagnosing COVID-19 from CT scans. The performance of the ensemble model is given in Table 2. It was observed that the proposed approach using ensemble models using majority voting scheme for final prediction is better than individual models with higher accuracy (0.85), precision (0.857) and recall (0.852). The confusion matrix for the best ensemble model is presented in Table 3. In this table, 22 images labeled as negative COVID-19 are determined positive COVID-19 by our proposed method. The reason for this may be that using other types of lung diseases in test set is labeled as negative COVID-19.

Table 3 Confusion matrix for the best ensemble model

Discussion

In this research, we investigate the performance of an ensemble deep learning structure for automated detection of COVID-19 on a publicly available dataset of CT images. We explored 15 state-of-the-art pre-trained CNN architectures which have trained and shown excellent performance on the ImageNet dataset, namely EfficientNets(B0-B5), NasNetLarge, NasNetMobile, InceptionV3, ResNet-50, SeResnet50, Xception, DenseNet121, ResNext50 and Inception_resnet_v2. Then, these pre-trained networks are fine-tuned on the target task that has limited training data. EfficientNetB0 model is the best among the other pre-trained CNNs model based on the accuracy (0.82), precision (0.847) and recall (0.822) metrics in the classification of CT images. Finally, to improve the classification accuracy, an ensemble method based on majority voting of different architectures of deep transfer learning outputs has been developed. We observe that the majority voting of prediction of 5 deep transfer learning architecture with EfficientNetB0, EfficientNetB3, EfficientNetB5, Inception_resnet_v2 and Xception is the best model with higher accuracy (0.85), precision (0.857) and recall (0.852) than the individual transfer learning models.

An ensemble of classifiers typically outperforms a single classifier due to reducing variance in the final prediction. In the context of deep transfer learning, an ensemble of classifiers can be built by different deep transfer learning by changing the architecture of the pre-trained CNN networks. Certain neural architectures facilitate transfer learning better than others for representation learning. A total of different deep transfer learning structures were used to form an ensemble of the classifier. We evaluate the efficacy of different deep transfer learning with different network architectures via extensive experiments. A total of 3, 5, 7, 9, 11, 13 and 15 different architectures of deep transfer learning were used to form an ensemble of classifier, and the final classification was done based on majority voting.

So, compared to other similar studies, this work has the advantage of comparing the deep transfer learning models with different pre-trained CNNs and an ensemble deep transfer learning system. The main drawback of the research can be considered the dataset size to train the networks. By performing regularization terms and simplifying deep models, we were able to overcome this problem. Our aim in the future is to further expand the experimental space by collecting more samples and employing the developed methodology on other CT images.

Conclusion

We perform a comprehensive study to systematically investigate the effects of different transfer learning structures with different pre-trained CNNs architectures and propose a new ensemble approach, which synergistically integrates transfer learning strategies for COVID-19 diagnosis and provides insightful findings on a publicly available dataset of CT images. It was observed that the majority voting of 5 deep transfer learning architecture with EfficientNetB0, EfficientNetB3, EfficientNetB5, Inception_resnet_v2 and Xception has the higher results than the individual transfer learning structure based on precision (0.857), recall (0.854) and accuracy (0.85) metrics. So, our proposed method can work well for the diagnosis of COVID-19 based on CT scans.