1 Introduction

Every year, 12% of women are diagnosed with breast cancer (McGuire et al. 2015). In the US alone, 40,000 women die of breast cancer annually (Bharati et al. 2018; Cancer.gov 2018). Evidence shows that early detection of breast cancer can significantly increase the survival rate of women.

Mammography, a special X-ray of the woman’s breast, is one of the most common diagnostic tools for detecting breast cancer (Bharati et al. 2020a, b; Thanh and Surya 2019). A 3D mammography is an advanced model compared to mammography. A 3D mammogram uses multiple breast X-rays to create a 3D picture of the breast. A 3D mammogram is used for finding breast cancer in patients who have no signs or symptoms. It can also be used to investigate other issues on breasts, such as breast mass, pain, and nipple discharge (Kumar et al. 2020).

When screening breast cancer, 3D mammogram machines will create 3D images and standard 2D mammogram images (Clinic 2020). Studies showed that “Combining 3D mammograms with standard mammograms reduces the need for additional imaging and slightly increases the number of cancers detected during screening”.

According to the mammography technique as well as the 3D mammography technique, one can show masses and even calcifications, which are precursors to breast cancer. However, correctly identifying these images can be challenging for radiologists. Moreover, time constraints in assessing the images often result in incorrect diagnosis with detrimental consequences. For instance, a false negative diagnosis, that a case is normal when it is, in fact, an early form of breast cancer, can decrease the chance of 5-year survival significantly.

A mammogram is a type of X-ray image of the breast. It can be captured by both mammography and/or 3D mammography. Doctors use the mammogram to identify the signs of breast cancer. So, it is capable of detecting calcifications, lumps, dimpling, etc. These are the common signs shown in the early stage of breast cancer. The mammograms of the digital database for screening mammography (DDSM) dataset is used in this paper. This dataset is available in a valid online repository (DDSM 2020) which is illustrated in the article of Lee et al. (2017). DDSM is a group of labeled mammographic images. This database is maintained by the research community (Heath et al. 1998). It is one of the largest datasets for studying breast cancer. Figure 1 shows some types of images of DDSM.

Fig. 1
figure 1

Some types of images of DDSM

2620 instances are contained in the dataset. The instances are the mammograms of patients with masses, calcifications, etc. The labels for calcification and masses are specified in four categories.

  1. (1)

    Benign.

  2. (2)

    Malignant.

  3. (3)

    Benign without a callback.

  4. (4)

    Unproven.

Besides, the images have been categorized on a scale of 1–5, according to the BI-RADS. BI-RADS means breast imaging, reporting, and data system. BI-RADS can be considered the most effective tool to detect breast cancer. Score 5 shows that the mammogram results are very suspicious, and the probability of breast cancer is almost 95%. To simplify our analysis, patches are used instead of full images. It helps not only for efficient computation but also for better performance. Because feature detection becomes easier. 10,713 patches are contained in our dataset. Table 1 summarizes the statistics of the DDSM patch data set.

Table 1 Summary of the DDSM patch data set

All images are separated at the abnormality level and full mammography as DICOM files where the whole mammography of breast images contains both CC and MLO visions of the breast mammograms. Moreover, abnormalities are depicted as binary mask breast images where the size of the images is the same as their related breast mammograms. The ROI of every abnormality is described in these mask breast images. Users can play out an element-wise choice of pixels inside an abnormality mask which was made for every mammogram. We have separated the images, including only abnormalities cropped for analysis of abnormalities. We have also separated the dataset like train, test, validation using python programming.

Exploring the data provides the plots shown in Figs. 2 and 3. Overall, there are more cases of masses than of calcification (in Fig. 2). The number of malignant and benign cases for calcification and masses seems to be the same. For both calcification and masses, a few cases have been categorized as ‘unproven’. In our analysis, we decide to mark these as pathological as it is not clear if they can be considered healthy patients or not (the number is small and should not have a strong negative impact on our predictive power). As we aim to model a binary classifier, we label all mass and calcification patches as pathological. In total, we have 4506 pathological and 6027 non-pathological patches (Fig. 3).

Fig. 2
figure 2

Classes and labels of the DDSM dataset

Fig. 3
figure 3

Split of the patches based on pathology

In this paper, we have proposed MVGG based on VGG 16. VGG 16 is modified in our application by fine-tuning the feed-forward, dense layers in the end to just one layer with 32 nodes, followed immediately by an output layer with sigmoid activation and one node (for binary classification). Binary classification is necessary to predict breast cancer. Therefore, categorizing mass and calcification labels are categorized as ‘pathological’, and normal images are categorized as ‘non-pathological’. VGG16 is designed initially to label up to 1000 classes and therefore have wide dense layers (4096 nodes). The width of these layers is cut down not to mix the information of the features at the time of passing from 4096 nodes to just one node in the output layer.

The significant contributions of the research work listed as follows:

  • Modified VGG model has been proposed to diagnose breast cancer utilizing 2D and 3D images of mammograms.

  • The proposed hybrid transfer learning model (a fusion of MVGG and ImageNet) provides an accuracy of 88.3% which surpasses existing machine learning models.

  • The data augmentation and regularization approach enhance the breast cancer detection rate and improve the proposed system performance.

2 Literature reviews

There are several machine and deep learning approaches in health care systems. ML is conducted in various domains, including health care, disease detection, biomedical, etc. (Tiwari and Melucci 2019a). Some works (Tiwari and Melucci 2018, 2019b; Khamparia et al. 2020) related to binary and multi-class classifications using machine learning have been proposed, and they exhibited some performance matrix-like accuracy, recall, precision, F1 score, etc. (Tiwari and Melucci 2018). Some unsupervised algorithm was already utilized for the treatment of breast cancer, lung cancer, and coronavirus (Tiwari et al. 2020; Mondal et al. 2020). Therefore, we can use deep learning for the detection of diseases from image data. Conversely, the image fusion algorithm was conducted for medical images where the big data was efficiently utilized (Tan et al. 2020). Traditional deep learning techniques are used for detecting blood cells. The images of the dataset were 13 k and provided results according to the performance matrix (Tiwari et al. 2018). Next, a hybrid method was offered by Reddy et al. (2020). They used hybrid deep belief networks (DBNs) and MRI images to detect glioblastoma tumors. The proposed method combined DBN with DTW to improve the efficiency of DBN. Thus, we use hybrid MVGG16 ImageNet for enhancing efficiency.

A deep learning-based system for the classification of the images of breast tissue is proposed in Rakhlin et al. (2018). For those images, 650 × 650 has been extracted with 400 × 400 pixels. Next, pre-trained VGG-16, InceptionV3, and ResNet-50 networks are conducted for the feature extraction. 10-fold cross-validation with LightGBM classifier has been driven to the classification and extraction of in-depth features. That technique gets an average accuracy of 87.2% across leave on out for breast cancer image classification (Rakhlin et al. 2018). In the other work (Kwok 2018), 4-DCNN architectures, i.e. InceptionResnetV2, InceptionV3, VGG19, and InceptionV4 have been used for the classification of images of breast cancer. The size of the images is 1495 × 1495 of 99 pixels. Various data augmentation systems have also been developed to increase the accuracy. In Vang et al. (2018), the ensemble-based architecture is proposed for multi-class image classification of breast cancer. Their conducted ensemble classifier involved; logistic regression, gradient boosting machine (GBM), majority voting to achieve the final prediction.

Moreover, the ensemble-based boosted neural network is also used for the diagnosis of lung cancer (Alzubi et al. 2019). The bagging algorithm is improved in the paper of Alzubi (2015). This algorithm cannot provide a good result for this complex dataset. Therefore, our proposed work will carry out for other complex image data. It can be used in IoT healthcare system and will be transmitted data securely according to the authors of Rani et al. (2019). Furthermore, the authors of Qian et al. (2020) conducted an unsupervised dictionary learning in an internet-based healthcare system for patient monitoring. They offered an ECG compression method where they measured EEG. This method has developed the dictionary continuously while the hidden pattern refined and occurred the dictionary. In Vahadane et al. (2015), the stain-normalization technique is applied to stain images for normalization where the achieving accuracy is 87.50%. Another research (Sarmiento and Fondón 2018) conducts a machine learning method where feature vectors are extracted from various characteristics i.e. texture, color, shape, etc. For the tenfold cross-validation, SVM provides 79.2% accuracy. Lastly, the paper (Nawaz et al. 2018) uses a fine-tuned AlexNet for the automatic classification of breast cancer. They achieve an accuracy of 75.73% where the patch-wise dataset is used.

Our literature review covered three topics: (1) The state-of-art deep learning architectures for the task of binary classification of images, (2) Performance achieved in similar tasks as a benchmark for our algorithm, (3) Studies on physicians’ performance to understand the clinical implications of such algorithms.

2.1 State of the art architectures

VGG network is presented by the author of Simonyan and Zisserman (2014). It is a simple model. It consists of a 13 layered CNN where 3 × 3 filters (Fig. 4) are used. VGG model has 2 × 2 max-pooling layers. The performance of multiple, smaller-sized kernels is comparatively better than a single larger-sized kernel because the increased depth of the VGG network can support the kernel to learn more complex features.

Fig. 4
figure 4

The characteristic of 3 × 3 convolution layers of the VGG

Secondly, Residual Network (ResNet) is considered. For image classification, ResNet is the most popular architecture. It is presented by the paper (He et al. 2016). Residual block can be considered a distinguishing feature in ResNet. (in Fig. 5). The residual block allows the residual network to achieve a depth of 152 layers. Vanishing gradients is a common problem in DCN. This problem can be moderated by the residual block. Because of vanishing gradients, the performance of ResNet can be degraded with the increase of depth.

Fig. 5
figure 5

The residual model of the ResNet architecture

Finally, MobileNets for mobile and embedded vision applications are proposed, which are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. This network is introduced by the authors of Howard et al. (2017). The core layer of MobileNet is depth-wise separable filters, named as depth-wise separable convolution. Finally, the width and resolution can be tuned to tradeoff between latency and accuracy. The purpose of using MobileNet in comparison to other architectures is that it has very little computation power to run or apply transfer learning to. This makes it a perfect fit for Mobile devices, embedded systems, and computers without GPU or low computational efficiency with compromising significantly with the accuracy of the results (in Fig. 6).

Fig. 6
figure 6

MobileNet architecture

2.2 Prior art in breast cancer classification

To detect breast cancer, several algorithms and classification have been developed using different datasets. For instance, a paper published in 2015 obtained 85% accuracy for identifying images with a mass and also localizing 85% of masses in mammograms with an average false positive rate per image of 0.9 (Ertosun and Rubin 2015). In Shen (2017), developed an end-to-end training algorithm for a whole-image diagnosis. It deploys a simple convolutional design achieving a per-image AUC score of 0.88 on the DDSM dataset. We adopt this metric as the benchmark for our algorithm, in addition to an accuracy benchmark of 85%.

2.3 Physician performance

Several high-quality studies explored the performance of physicians diagnosing mammograms. A study was looked at radiologist performance on mammographs from 1192 patients (Rafferty et al. 2013). In a first study, 312 cases (48 cancer cases) were diagnosed by 12 radiologists who recorded if an abnormality that requires a callback was present. This resulted in a sensitivity of 65.5% and a specificity of 84.1%. In a second study, 312 cases (51 cancer cases) were analyzed by 15 radiologists. They obtained additional training and also reported the type and location of the lesion. It was in a sensitivity of 62.7% and a specificity of 86.2%. Another high-quality study compared different diagnosis methods such as mammography, ultrasonography (US), and physical examination (PE) using a data set of 27,825 screening sessions Kolb et al. (2002) and compared the results of the three diagnosis methods with the actual biopsy. The results showed a sensitivity of 77.6% and a specificity of 98.8%. However, these scores were not achieved by radiologists using only mammograms and thus do not fit well for a benchmark for this task.

Most relevant as a benchmark for our analysis is the first study by the authors of Rafferty et al. (2013) as the 12 radiologists restricted themselves to binary classification, which is similar to our approach.

2.4 Clinical significance

Medical suggestions of the mammogram diagnosis are essential to improve an algorithm with Clinical significance. Radiologists have a considerably higher specificity than sensitivity. So, it means that the false-negative rate is higher than the false positive rate. Table 2 illustrates the Comparative analysis of risks for two types (1 and 2) of diagnostic errors.

Table 2 Comparison of risks for type 1 and 2 diagnostic errors

False-positive diagnosis indicates that the radiologist judges a normal mammogram of malignant or benign type. As a result, that patient has to revisit the clinic, and in most cases, further testing through a biopsy is performed. Biopsy for breast cancer detection is minimally invasive, and only a small incision is needed. However, there is a range of evidence showing the psychological effects of such false positives. According to a study from 2000, it can lead to short-term distress as well as long-term anxiety (Aro et al. 2000). On the contrary, a false negative implies that a potentially cancerous case is misinterpreted as healthy. The consequences of this can be very severe because breast cancer when left untreated progresses in its stages and with each stage, a different 5-year life expectancy is associated (in Table 3 Cancer.gov 2018).

Table 3 The survival rate of breast cancer stages

In general, it is more important to avoid false-negative over a false positive. A mammogram is first followed by sonography, and if this is positive as well, further testing is done via a minimally invasive biopsy. However, in the future, it would be great to differentiate further and decrease the false positive in the three categories of BI-RADS. As of now, 98% of patients in this category have to come back every 6, 12, 24 months for a check-up, yet do not have breast cancer. This is a large burden.

Considering Table 2 and the two interviews, we conclude that a false negative error can have more severe consequences than a false positive error. Thus, we decide to design our algorithm to have a threshold that is more sensitive than specific.

3 Modeling

3.1 Data cleaning

Before building the model, we clean the data set to convert it into the appropriate form. We assign new, binary labels to the images by categorizing all the original mass and calcification labels as ‘pathological’, and the normal images as ‘non-pathological’. Thus, the problem is decreased to binary classification. Next, we randomly divide the data set into train, validation, and test splits, in approximate proportions of 75:10:15, respectively. While doing so, we ensure that the splits are evenly balanced between the two classes (as evident in Fig. 7). Train test split method is adopted to check the validation of this work. For validation of this work, 75% of the total images are trained in the total dataset.

Fig. 7
figure 7

Balance of classes in the train, validation, and test splits

3.2 Performance evaluation

The performance of classification is evaluated by the following metrics: accuracy and AUC. AUC means the area under the ROC curve. These two parameters are widely-used for evaluating classification. Accuracy means the percentage of cases that the model classifies correctly. AUC implies the capability of the model to discriminate between the two classes.

3.3 Modeling process

The flow chart can easily describe the modeling process shown in Fig. 8.

Fig. 8
figure 8

Workflow diagram of the modeling process

3.4 Model building

Model building is the first step of model processing. It can be divided into four sub-stages:

  1. 1.

    Construction of the baseline model and performance evaluation.

  2. 2.

    Training of popular models with various architectures, and selection of best models.

  3. 3.

    Deployment of regularization and data augmentation methods to develop the performance. Choose the best model as the final model based on the performance.

  4. 4.

    Tuning of hyper-parameters on the final model to accomplish the desired level.

Table 4 describes the details of the architecture of the baseline (simple) model. There are two 2-dimensional convolution layers. There are 32 and 64 filters. There is a dense layer in the architecture having 32 nodes on the top. At the time of the performance evaluation of the model on the test set, the accuracy of 75.9% is obtained. It is about 13% lower than the desired accuracy or desired benchmark.

Table 4 Architecture of the baseline model

In the next stage, three image classification models are implemented. The models are ALEXNET, VGG16, VGG19, MVGG, MobileNet, and ResNet50. The models are then modified by tuning the feed-forward, dense layers in the end to only one layer having 32 nodes. These models are designed initially to label up to 1000 classes and therefore have wide dense layers (4096 nodes).

Two additional methods have been considered after choosing the final model to observe if they improve performance. The two techniques are

  1. (1)

    Data augmentation and

  2. (2)

    Pre-training.

In the data augmentation stage, three operations are performed on the input images. The operations are:

  1. (1)

    Flip the images along a horizontal axis

  2. (2)

    Shift vertically/horizontally within a width range of 0.2

  3. (3)

    Rotate randomly within a twenty-degree range.

The pre-training process includes initializing model parameters with values learned from a different data set, instead of random ones. The pre-training process not only can speed up learning but also achieve improved local optima in gradient optimization. In this paper, the best model is pre-trained using weights. ImageNet data set is trained in this case.

To tune on the best model, batch size and learning rates are varied to enhance the accuracy.

4 Results and interpretation

4.1 Performance of different models

The performance of different implemented models is shown in Table 5.

Table 5 Performance evaluation of different models

Table 5 depicts the comparison of various classification algorithms including ALEXNET, ResNet50, MobileNet, VGG16, VGG19, etc. where our final model hybrid MVGG16 ImageNet provides higher accuracy, precision, recall, and F1 score than another traditional algorithm for our DDSM dataset. We calculate our results for 15 epochs. Validation loss gets saturated after 15 epochs that why we considered these many numbers of epochs. It depends on how validation loss is behaving after each epoch. VGG 16 is modified in our application by fine-tuning the feed-forward, dense layers in the end to just one layer with 32 nodes, followed immediately by an output layer with sigmoid activation and one node (for binary classification). At the end of the model building process, we realize that the pre-trained modified VGG16 (MVGG 16) model outperforms all others in terms of accuracy. The architecture of the model is shown in Fig. 9. It produces an accuracy of 86.9% on the test set and an AUC of 0.933. This is better than our benchmark on both metrics. In Fig. 10, we can see that the model starts to strongly over fit after 6 epochs.

Fig. 9
figure 9

Details of the architecture of MVGG

Fig. 10
figure 10

Training and validation of the best model

After testing three architectures, it is seen that Modified VGG16 outperforms both ResNet50 and MobileNet. MobileNet produces lower accuracy. The modified VGG16 model outperforms ResNet50 model. It can be assumed that this might be due to the features of the images and fixation of ResNet50’s loss function on a higher local minimum.

It is also observed that the pre-training model provides better performance compared to the data augmentation. Because the initial weights might have enabled the model to find a better local minimum of the loss function during the gradient descent process. It can also be useful to run the model with extra resources and data augmentation for more epochs as the convergence is slow owing to the large size of the data.

4.2 ROC analysis

Figure 11 showed that the final modified proposed model has an AUC value of 93.3%. This is better than our benchmark of an AUC value of 88% from (Shen 2017). Additionally, our model also outperforms radiologists to classify mammograms as pathological or not. The benchmark of the first study on 12 radiologists on 312 cases with a sensitivity of 65.5% and a specificity of 84.1% (Rafferty et al. 2013) was surpassed by our model. For study 2, the physicians did use not only mammography but also other diagnostics, which could be a reason for better results.

Fig. 11
figure 11

ROC curve of the final model

After finalizing the algorithm, we estimate the mathematically optimal mode for the algorithm is, i.e. what the best threshold for the algorithm to declare a mammogram as either pathological or non-pathological. We compute the Youden’s J statistic as follows:

$$ J = maximum\,sensitivity\left( c \right) + specificity\left( c \right) - 1 . $$
(1)

Youden’s index is the probability of an informed decision (as opposed to a random guess) and considers all predictions. We have used it for setting optimal thresholds on medical breast cancer tests. In different words, this threshold minimizes the error rate of false positive and false negative, taking both as equally important. However, as written in chapter 3.4. from a clinical perspective, reducing false negatives is more important than false positives. Thus, we decided to weigh reducing false negatives twice as important as false positive and calculated the optimal threshold maximizing the cost function as 0.66 * true positive rate + 0.33 * (1-false negative rate). The clinically optimal threshold is 0.17, enabling us to further increase the false positive rate (thereby decreasing the false negative rate) by 10% while increasing the false positive rate by 15%.

Conclusively, this model with its well-performing accuracy as well as the estimated clinically relevant threshold would be well suited to sufficiently reduce errors, especially false negatives, in the clinical setting.

Several data mining algorithm is applied for cancer detection and classification (Shapiro et al. 1982; Bharati et al. 2019; Zhou et al. 2020; Celik et al. 2020; Benhammou et al. 2020; Kose and Alzubi 2020) using the dataset as a CSV file, but disease detection and classification using image dataset is a challenging task. To classify images into multiple categories such as benign, malignant (Hu et al. 2020; Bharati et al. 2018), and normal, our focus is to implement binary classification. This is because classifying a case as normal with higher confidence is more clinically relevant and immediately applicable than multinomial classification. The strength of the paper is the balance between breadth and depth in the scope. Testing various transfer learning models can enable us to recognize the best model for the task. On the other hand, there are some shortcomings with more time consuming and computing resources. So, we have tried to fine-tune our proposed models better, trying different hyper-parameters, and constructing our network.

4.3 Comparison among the others work of accuracy

In the work of Rakhlin et al. (2018), the authors used the fusion methods of various deep CNN algorithms. They calculated sensitivity, AUC for two class and four class classification of breast cancer. Moreover, the authors of Kwok (2018) and Nawaz et al. (2018) proposed Inception-Resnet-v2 and ALEXNET, respectively for the same dataset. The using dataset of Rakhlin et al. (2018), Kwok (2018), Vang et al. (2018), Sarmiento and Fondón (2018), Nawaz et al. (2018) differs from our using dataset. Therefore, direct comparisons are not sufficient. Our adopted dataset is the same as the works of Li et al. (2019), Wang et al. (2019), Singh et al. (2020). The author of Li et al. (2019) proposed Dense U-Net algorithm that is not a traditional algorithm like DenseNet or U-Net. The obtained accuracy is 78.38%. Furthermore, the reference papers of Wang et al. (2019), Li et al. (2019) and Singh et al. (2020) also used our adopted DDSM dataset. They also proposed novel algorithms. The accuracy of these papers is less than our proposed method where the accuracy of 86.50% and 80% are obtained for CNN-GTD and cGAN, respectively.

Our proposed method provides higher accuracy than other methods presented in Table 5 for other transfer learning methods (Rakhlin et al. 2018; Kwok 2018; Vang et al. 2018; Sarmiento and Fondón 2018; Nawaz et al. 2018). It can also be explored from Table 6 that the architectures in Rakhlin et al. (2018), Kwok (2018), Vang et al. (2018), Nawaz et al. (2018), Sarmiento and Fondón (2018), Li et al. (2019), Wang et al. (2019) and Singh et al. (2020) provide an accuracy of 87.20%, 79.00%, 87.50%, 81.25%, 79.20%, 71%, 78.38%, 86.50%, and 80%, respectively, whereas our proposed model provides the accuracy of 94.3%.

Table 6 Comparative analysis with other existing works

5 Conclusion

The best performing architecture is an MVGG network which has been pre-trained on the ImageNet with an accuracy of 94.3% and the value of AUC is 93.3%. The clinical analysis shows that recall or sensitivity should be highlighted over specificity in the case of breast cancer. Thus, a clinical classification threshold is chosen, which is much lower than the mathematical threshold value. This algorithm will help to considerably decrease the false negative cases of mammograms. This will also increase the chances of 5-year survival.

There are different approaches we would like to follow in the future. Instead of binary classification, it will be interesting to make a categorical classification based on the BI-RADS scores. But in this situation, masses and calcification have to keep merged. We will also need additional, very detailed data set containing not only the BI-RADS scores but also other medical explanations.

Additionally, it will be interesting to integrate additional features into our algorithm. The tissue density of the woman plays a critical role in the breast cancer assessment. Obtaining this and adding it as a feature could potentially increase the accuracy of the algorithm significantly.