Abstract
Ductal carcinoma in situ (DCIS) is a pre-cancerous lesion in the ducts of the breast, and early diagnosis is crucial for optimal therapeutic intervention. Thermography imaging is a non-invasive imaging tool that can be utilized for detection of DCIS and although it has high accuracy (~ 88%), it is sensitivity can still be improved. Hence, we aimed to develop an automated artificial intelligence-based system for improved detection of DCIS in thermographs. This study proposed a novel artificial intelligence based system based on convolutional neural network (CNN) termed CNN-BDER on a multisource dataset containing 240 DCIS images and 240 healthy breast images. Based on CNN, batch normalization, dropout, exponential linear unit and rank-based weighted pooling were integrated, along with L-way data augmentation. Ten runs of tenfold cross validation were chosen to report the unbiased performances. Our proposed method achieved a sensitivity of 94.08 ± 1.22%, a specificity of 93.58 ± 1.49 and an accuracy of 93.83 ± 0.96. The proposed method gives superior performance than eight state-of-the-art approaches and manual diagnosis. The trained model could serve as a visual question answering system and improve diagnostic accuracy.
Similar content being viewed by others
Introduction
Ductal carcinoma in situ (DCIS), also named intra-ductal carcinoma is a pre-cancerous lesion of cells that line the breast milk ducts, but have not spread into the surrounding breast tissue. DCIS is considered the earliest stage of breast cancer (Stage 0) [1], and although cure rates are high the patients still need to be treated, since DCIS can become invasive. Note that there are four other stages: Stage 1 describes invasive breast cancer, the cancer cells of which are invading normal surrounding breast tissues. Stages 2 and 3 describe breast cancers that have invaded regional lymph nodes and Stage 4 represents metastatic cancer which spreads beyond the breast and regional lymph nodes to other distant organs [2]. Upon diagnosis of DCIS, treatment options include breast-conserving surgery (BCS), usually in combination with radiation therapy [3] or mastectomy.
Breast thermography (BT) is an alternative imaging tool to mammography, which is the traditional diagnostic tool for DCIS. Unlike mammography (which uses ionizing radiation to generate an image of the breast), BT utilizes infra-red (IR) images of skin temperature to assist in the diagnosis of numerous medical conditions, and has been suggested to detect breast cancer up to 10 years earlier than mammography [4]. Furthermore, due to its use of ionizing radiation, mammography can increase the risk of breast cancer by 2% with each scan [5].
Automatic interpretation of DCIS [6] by BT images consists of three phases: (1) segmentation of the region of interest, separating the breast from the image; (2) feature extraction, choosing distinguishing features that can help recognize the suspicious lesion; (3) classification, identifying the image as DCIS or healthy.
Previous studies have developed a number of effective artificial intelligence (AI) methods for DCIS detection using BT. Milosevic et al. [7] utilized 50 IR breast images to develop a co-occurrence matrix (COM) and run length matrix (RLM) as IR image descriptors. In the classification stage, a support vector machine (SVM) and naive Bayesian classifier (NBC) were used. Their methods are abbreviated as CRSVM and CRNBC. In addition, Nicandro et al. [8] employed NBC, whereas Chen [9] utilized wavelet energy entropy (WEE) as features to classify breast cancers with promising results. Zadeh et al. [10] combined self-organizing map and multilayer perceptron abbreviated as SMMP and Nguyen [11] introduced Hu moment invariant (HMI) to detect abnormal breasts. Finally, Muhammad [12] combined statistical measure and fractal dimension (SMFD), and Guo [13] proposed a wavelet energy support vector machine (WESVM) to detect breast cancer.
Nevertheless, the above methods require laborious feature engineering (FE), i.e., using domain knowledge to extract features from raw data. To help create an improved, automated AI model quickly and effectively, we proposed to use recent deep learning (DL) technologies, viz, convolutional neural networks (CNNs), which are a broad AI technique combining artificial intelligence and representation learning (RL).
Our contributions lie in four parts: (1) we proposed a novel 5-layer CNN; (2) we introduced exponential linear unit to replace traditional rectified linear unit; (3) we introduced rank-based weighted pooling to replace traditional pooling methods and (4) we used data augmentation to enhance the training set, so as to improve the test performance.
Background
Table 14 in “Appendix A” gives the abbreviations and their explanations for ease of reading.
Physical fundamentals
BT is a sub-science field within IR imaging sciences. IR cameras detect radiation in the long IR range (9–14 µm), with the thermal images generated being dubbed thermograms. Physically, Planck’s law stated the spectral of a body for frequency \(\omega\) at absolute temperature T is given as
where B stands for the spectral radiance, \(o\) the Planck constant, kB the Boltzmann constant, and \(l_{s}\) the light speed,. If replacing frequency \(\omega\) by wavelength λ using \(l_{s} = \lambda \omega\), above equation can be written as:
Both charge-coupled device (CCD) and complementary metal-oxide-semiconductor (CMOS) sensors in optical cameras detect visible light, and even near-infra-red (NIR) by utilizing parts of the IR spectrum. Basically, they could produce true thermograms with temperatures beyond 280 °C.
In our breast thermogram cases, the thermal imaging cameras have a range of 15–45 °C, and a sensitivity around 0.05 °C. Furthermore, three emitted components (ECs) help generate the following breast thermogram images: (1) EC of the breast, (2) EC of the surrounding medium, and (3) EC in the neighboring tissue.
Physiological fundamentals
In healthy tissue, the major regulation and control of dermal circulation is neurovascular, i.e. through the sympathetic nervous system. Its sympathetic response includes both adrenergic and cholinergic. The former causes vasoconstriction (VC, narrowing of blood vessels); conversely, the latter leads to vasodilation (VD, widening of blood vessels). The difference between VC and VD is presented in Fig. 1.
In the early stages of cancer growth, cancer cells produce nitric oxide (NO), resulting in VD. Tumor cells then initiate angiogenesis, which is necessary to sustain breast tumor growth. Both VD and angiogenesis lead to increased blood flow; therefore, the increased heat released as a result of increased blood flow to the tumor results in hotter areas than healthy skin.
The thermogram of a healthy person is symmetrical across the midline. Asymmetry in the thermogram might signify an abnormality, or even a tumor. Therefore, the thermogram illustrates the status of the breast and presence of breast diseases by identifying asymmetric temperature distribution.
Despite this, previous studies [7, 8, 14] have not measured asymmetry directly. As an alternative, those papers employed texture or statistical measures. As a result, this study did not use asymmetry information, and treated each side image (left breast or right breast) as individual images.
Dataset and preprocessing
240 DCIS breast images and 240 healthy breast (HB) images were obtained from 5 sources: (1) our previous study [12] and further collections after its publication. (2) Ann Arbor thermography [15]; (3) The Breast Thermography Image dataset [16]; (4) The Database for Mastology Research with Infrared Image [17] and (5) online resources using search engines including Google, Yahoo, etc.
Since our dataset is multi-source, we normalized all the collected images using preprocessing techniques. These included: (I) crop: remove background contents and only preserve the breast tissue and (II) resize: all images were re-sampled to the size of \(\left[ {128 \times 128 \times 3} \right]\). Suppose original image is \(x_{1} \left( t \right),t \in \left[ {1,480} \right]\). After Step I, we have
where \(\left( {l_{t} ,r_{t} ,t_{t} ,b_{t} } \right)\) are four parameters denotes left, right, top, and bottom margins of t-th image to be cropped.
Finally, after Step II, we have all the images \(x_{3} \left( t \right) \in X_{3}\)
Note some BT images used different pseudo colormaps (PCMs). For example, some used yellow to denote high temperature while some used red; conversely, some used blue to denote low temperature while some used green. We did not apply the same PCM to all BT images within our dataset for four reasons: (1) we expected our AI model would learn to determine a diagnosis based on color difference, not the color itself; (2) humans can make a diagnosis regardless of the PCM configuration, so we believed AI can do the same; (3) we expected our AI model can be universal, i.e., PCM-independent and (4) mixing of PCM color schemes in the training set can help make our AI model more robust when analyzing the test set, i.e., it does not require a particular PCM scheme.
Figure 2 shows a DCIS case, where we can clearly see the temperature difference of the lesion and the surrounding healthy tissues. All the images included in our dataset were checked by agreement of two professional radiologists \(\left( {R_{1} ,R_{2} } \right)\) with more than 10 years of experience. If their decisions \(\left[ {H\left( {R_{1} } \right),H\left( {R_{2} } \right)} \right]\) agreed, then the images were labelled correspondingly, otherwise, a senior radiologist \(\left( {R_{3} } \right)\) was consulted to achieve a consensus:
Here \(H\) is the labelling result, \(M\) denotes the majority voting, \(H\left[ {x_{3} \left( t \right),\left( {R_{1} ,R_{2} ,R_{3} } \right)} \right]\) denotes the labelling results by all three radiologists.
Methodology
Improvement 1: exponential linear unit
The activation function mimics the influence of an extra-cellular field on a brain axon/neuron. The real activation function for an axon is quite complicated, and can be written as
where n means the index of axon’s compartment model, c the membrane capacity, \(R_{n}\) the axonal resistance of compartment n, \(V_{n}^{e}\) the extra-cellular voltage outside compartment n relative to the ground [18]. This is difficult to determine in an “artificial neural network”, and thus AI scientists designed some simplistic and ideal activation functions (AFs), which have no direct connection with the axon’s activating function, but those AFs work well for ANNs [19].
An important property of AF is nonlinearity. The reason is stacks of linear function will also be linear, and those kinds of linear AFs can only solve trivial problems and cannot make decisions. Only nonlinear AF can allow neural networks to solve non-trivial problems, such as decision-making. Similar ideas were mentioned as “even our mind is governed by the nonlinear dynamics of complex systems” by Mainzer [20].
Suppose the input is t, traditional rectified linear unit (ReLU) [21] \(f_{{{\text{ReLU}}}}\) is defined as
with its derivative as
When \(t < 0\), the activation of \(f_{{{\text{ReLU}}}}\) values are set to zero, so ReLU cannot train the networks via gradient-based learning. Clevert et al. [22] proposed the exponential linear unit (ELU)
ELU’s derivative is
The default value of \(\gamma = 1\). Figure 3 represents the shapes of five different but common AFs. Each subplot has the same range on the x-axis and y-axis for easy comparison. Information regarding the three AFs (Sigmoid, HT, and LReLU) can be found in “Appendix B”.
Improvement 2: rank-based weighted pooling
The activation maps (AMs) after conv layer are usually too large, i.e., the size of their width, length, and channels are too large to handle, which will cause (1) overfitting of the training set and (2) large computational costs. Instead pooling layer (PL) is a form of nonlinear downsampling (NLDS) used to solve the above issue. Further, PL can provide invariance-to-translation properties to the AMs.
For a \(2 \times 2\) region, suppose the pixels within the region \(\Phi = \left\{ {\varphi_{ij} } \right\},\left( {i = 1,2,j = 1,2} \right)\) are
Strided convolution (SC) can be regarded as a convolution followed by a special pooling. If the stride is set to 2, the output of SC is:
The shortcoming of SC is that it will miss stronger activations if \(\varphi_{1,1}\) is not the strongest activation. The advantage of SC is the convolution layer only needs to calculate 1/4 of all outputs in this case, so it can save computation.
L2P calculates the \(l_{2}\) norm [23] of a given region \(\Phi\). Assume the output value after NLDS is y, L2P output \(y_{\Phi }^{{{\text{L}}2{\text{P}}}}\) is defined as \(y_{\Phi }^{{{\text{L}}2{\text{P}}}} = {\text{sqrt}}\left( {\mathop \sum \nolimits_{i,j = 1}^{2} \phi_{ij}^{2} } \right)\). In this study, we add a constant \(1/\left| \Phi \right|\), where \(\left| \Phi \right|\) means the number of elements of region \(\Phi\). Here \(\left| \Phi \right| = 4\) if we use a \(2 \times 2{ }\) NLDS pooling. This added new constant 1/4 does not influence training and inference.
The average pooling (AP) calculates the mean value in the region \(\Phi\) as
The max pooling (MP) operates on the region \(\Phi\) and selects the max value. Note that L2P, AP and MP work on every slice separately.
Rank-based weighted pooling (RWP) was introduced to overcome the down-weight (DW), overfitting, and lack of generation (LG) caused by the above pooling methods (L2P, AP, and MP). Instead of computing the \(l_{2}\) norm, average, or the max, the output of the RWP \(y_{\Phi }^{{{\text{RWP}}}}\) is calculated based on the rank matrix.
First, rank matrix (RM) \(R = \left\{ {r_{m} } \right\}\) is calculated based on the values of each element \(\varphi_{m} \in \Phi\), usually lower ranks \(r \in R\) are assigned to higher values (\(\varphi\)) as
In case of tied values (\(\varphi_{m1} = \varphi_{m2}\)), a constraint is added as
Second, (ER) map \(E = \left\{ {e_{m} } \right\}\) is defined as
where α is a hyper-parameter. \(\alpha = 0.5\) for all RWP layers, so we do not need to tune \(\alpha\) in this study. Equation (18) can be updated as
Third, RWP [24] is defined as the summation of \(\varphi_{ij}\) and \(e_{ij}\) as below
Figure 7 in “Appendix C” gives a schematic comparison of L2P, AP, MP, and RWP.
For better understanding, a pseudocode of RWP is presented in Table 1. We suppose there is an activation map \(X_{{{\text{AM}}}}\) with size of \(\left[ {R,C} \right]\), where \(R\) means the number of rows, and \(C\) means the number of columns. Note row index is set to \(r\) and column index \(c\). The RWP output of \(X_{{{\text{AM}}}}\) is symbolized as \(y^{{{\text{RWP}}}}\) with size of \(\left[ {\frac{R}{2},\frac{C}{2}} \right]\). Table 2 itemizes the equations of every pooling methods.
Improvement 3: L-way data augmentation
Traditional data augmentation is a strategy that enables AI practitioners to radically increase the diversity of training data, without collecting new data actually. In this study, we proposed a L-way data augmentation (LDA) technology to further increase the diversity of the training data. The whole preprocessed image set \(X_{3}\), from Eq. (4), will separate into \(K\) folds:
where \(k\) represents the fold index.
At k-th trial, fold k will be used as the test set \(D_{k}\), and other folds will be used as the training set \(C_{k}\):
If we do not consider the index k, and just simplify the situations as \(X_{3} \to \left\{ {C,D} \right\}\), for each training image \(c\left( k \right) \in C,k = 1, \ldots ,\left| C \right|\), we will do the following eight DA techniques. Here we suppose each DA technique will generate \(W\) new images.
-
(1)
Gamma correction (GC). The equations are defined as:
$$ \begin{aligned} \overrightarrow {{c^{1} \left( k \right)}} & = {\text{GC}}\left[ {c\left( k \right)} \right] \\ & = \left[ {c_{1}^{{{\text{GC}}}} \left( {k,\eta _{1}^{{{\text{GC}}}} } \right), \ldots c_{W}^{{{\text{GC}}}} \left( {k,\eta _{W}^{{{\text{GC}}}} } \right)} \right], \\ \end{aligned} $$(23)where \(\eta_{j}^{{{\text{GC}}}} \left( {j = 1, \ldots ,W} \right)\) are GC factors.
-
(2)
Rotation. Rotation operation rotates the original image to produce W new images [25]:
$$ \begin{aligned} \overrightarrow {{c^{2} \left( k \right)}} &= {\text{RO}}\left[ {c\left( k \right)} \right] \hfill \\ &= \left[ {c_{1}^{{{\text{RO}}}} \left( {k,\eta_{1}^{{{\text{RO}}}} } \right), \ldots c_{W}^{{{\text{RO}}}} \left( {k,\eta_{W}^{{{\text{RO}}}} } \right)} \right] \hfill \\ \end{aligned} $$(24)where \(\eta_{j}^{{{\text{RO}}}} \left( {j = 1, \ldots ,W} \right)\) are rotation factors.
-
(3)
Scaling. All training images \(c\left( k \right)\) were scaled [25] as
$$ \begin{aligned} \overrightarrow {{c^{3} \left( k \right)}} &= {\text{SC}}\left[ {c\left( k \right)} \right] \hfill \\ &= \left[ {c_{1}^{{{\text{SC}}}} \left( {k,\eta_{1}^{{{\text{SC}}}} } \right), \ldots c_{W}^{{{\text{SC}}}} \left( {k,\eta_{W}^{{{\text{SC}}}} } \right)} \right], \hfill \\ \end{aligned} $$(25)where \(\eta_{j}^{{{\text{SC}}}} \left( {j = 1, \ldots ,W} \right)\) are scaling factors.
-
(4)
Horizontal shear (HS) transform. \(W\) new images were generated by HS transform
$$ \begin{aligned} \overrightarrow {{c^{4} \left( k \right)}} &= {\text{HS}}\left[ {c\left( k \right)} \right] \hfill \\ &= \left[ {c_{1}^{{{\text{HS}}}} \left( {k,\eta_{1}^{{{\text{HS}}}} } \right), \ldots c_{W}^{{{\text{HS}}}} \left( {k,\eta_{W}^{{{\text{HS}}}} } \right)} \right], \hfill \\ \end{aligned} $$(26)where \(\eta_{j}^{{{\text{HS}}}} \left( {j = 1, \ldots ,W} \right)\) are HS factors.
-
(5)
Vertical shear (VS) transform. VS transform was generated similarly to HS transform
$$ \begin{aligned} \overrightarrow {{c^{5} \left( k \right)}} & = {\text{VS}}\left[ {c\left( k \right)} \right] \\ & = \left[ {c_{1}^{{{\text{VS}}}} \left( {k,\eta_{1}^{{{\text{VS}}}} } \right), \ldots c_{W}^{{{\text{VS}}}} \left( {k,\eta_{W}^{{{\text{VS}}}} } \right)} \right], \\ \end{aligned} $$(27a)$$ \eta_{m}^{{{\text{VS}}}} = \eta_{m}^{{{\text{HS}}}} ,\forall m \in 1,2, \ldots ,W. $$(27b) -
(6)
Random translation (RT). All training images \(c\left( k \right)\) were translated \(W\) times with random horizontal shift \(\varepsilon^{x}\) and random vertical shift \(\varepsilon^{y}\), both values of which are in the range of \(\left[ { - \Delta ,\Delta } \right]\), and obey uniform distribution \({\mathcal{U}}\):
$$ \begin{aligned} \overrightarrow {{c^{6} \left( k \right)}}& = {\text{RT}}\left[ {c\left( k \right)} \right] \hfill \\& = \left[ {c_{1}^{{{\text{RT}}}} \left( {k,\varepsilon_{1}^{x} ,\varepsilon_{1}^{y} } \right), \ldots c_{W}^{{{\text{RT}}}} \left( {k,\varepsilon_{W}^{x} ,\varepsilon_{W}^{y} } \right)} \right], \hfill \\ \end{aligned} $$(28)where
$$ \varepsilon_{m}^{x} \sim {\mathcal{U}}\left[ { - \Delta ,\Delta } \right],\forall m \in \left[ {1,W} \right] $$(29a)$$ \varepsilon_{m}^{y} \sim {\mathcal{U}}\left[ { - \Delta ,\Delta } \right],\forall m \in \left[ {1,W} \right], $$(29b)where \(\Delta\) is the maximum shift factor.
-
(7)
Color jittering (CJ). CJ shifts the color values in original images [26] by adding or subtracting a random value. The advantage of CJ is it can help bring in randomness change to the color channels, so it can aid production of fake color images:
$$ \begin{aligned} \overrightarrow {{c^{7} \left( k \right)}} &= {\text{CJ}}\left[ {c\left( k \right)} \right] \hfill \\ &= \left[ {c_{1}^{{{\text{CJ}}}} \left( {k,\xi_{1}^{r} ,\xi_{1}^{g} ,\xi_{1}^{b} } \right), \ldots c_{W}^{{{\text{CJ}}}} \left( {k,\xi_{W}^{r} ,\xi_{W}^{g} ,\xi_{W}^{b} } \right)} \right]. \hfill \\ \end{aligned} $$(30)The shifted color random values are within the range of \(\left[ { - \varpi , + \varpi } \right]\), as
$$ \xi_{m}^{{{\text{CC}}}} \sim {\mathcal{U}}\left[ { - \varpi ,\varpi } \right] $$(31a)$$ \forall m \in \left[ {1,W} \right] \wedge \forall {\text{CC}} \in \left\{ {r,g,b} \right\}, $$(31b)where CC means color channel. \(\varpi\) means maximum color shift value.
-
(8)
Noise injection. The 0-mean 0.01-variance Gaussian noises [27] were added to all training images to produce \(W\) new noised images:
$$ \begin{aligned} \overrightarrow {{c^{{L/2}} \left( k \right)}} & = {\text{NO}}\left[ {a\left( k \right)} \right] \\ & = \left[ {c_{1}^{{{\text{NO}}}} \left( k \right), \ldots c_{W}^{{{\text{NO}}}} \left( k \right)} \right], \\ \end{aligned} $$(32)where NO denotes the noise injection operation.
-
(9)
Mirror and concatenation. All the above \(L/2\) results are mirrored, we have
$$ \begin{array}{*{20}c} {\overrightarrow {{c^{L/2 + 1} \left( k \right)}} = M\left( {\overrightarrow {{c^{1} \left( k \right)}} } \right)} \\ \end{array} $$(33a)$$ \begin{array}{*{20}c} {\overrightarrow {{c^{L/2 + 2} \left( k \right)}} = M\left( {\overrightarrow {{c^{2} \left( k \right)}} } \right)} \\ \end{array} $$(33b)$$ \cdots $$$$ \begin{array}{*{20}c} {\overrightarrow {{c^{L} \left( k \right)}} = M\left( {\overrightarrow {{c^{L/2} \left( k \right)}} } \right)} \\ \end{array} . $$(33c)where M represents the mirror function. All the results are finally concatenated as
$$ \underbrace {{\overrightarrow {{c^{{{\text{LDA}}}} \left( k \right)}} }}_{L \times W + 1} = {\text{concat}}\left\{ {\underbrace {c\left( k \right)}_{1},\underbrace {{\overrightarrow {{c^{1} \left( k \right)}} }}_{W}, \ldots ,\underbrace {{\overrightarrow {{c^{L} \left( k \right)}} }}_{W}} \right\}. $$(34)The size of \(\overrightarrow {{c^{{{\text{LDA}}}} \left( k \right)}}\) is \(L \times W + 1\) images. Thus, the LDA can be regarded as a function \(c\left( k \right) \mapsto \overrightarrow {{c^{{{\text{LDA}}}} \left( k \right)}}\).
Proposed models and algorithm
We proposed five models in total in this study. Table 3 presents their relationships. Model-0 was the base CNN model with \(N_{{{\text{CL}}}}\) conv layers and \(N_{{{\text{FCL}}}}\) fully connected layers. In Model-0, we used max pooling (MP) and ReLU activation function. Model-1 combined Model-0 with batch normalization (BN) and dropout (DO). Model-2 used ELU to replace ReLU in Model-1, while Model-3 used RWP to replace MP in Model-1. Finally, Model-4 introduced both ELU and RWP to enhance the performance based on Model-1.
The top row of Fig. 4a shows the activation maps of the proposed Model-0. Here the size of input was \(S_{0} = 128 \times 128 \times 3\), the first conv block is composed of one conv layer, one activation function layer, and one pooling layer. After conv layer, \(S_{1} = 128 \times 128 \times 32\). Then after the activation function layer, the output is the same as \(S_{1}\). After the pooling layer, the size is \(S_{2} = 64 \times 64 \times 32\). The conv block then repeats three times, we have \(S_{3} = 64 \times 64 \times 64\) and \(S_{4} = 32 \times 32 \times 64\) for the second conv block, \(S_{5} = 32 \times 32 \times 128\), and \(S_{6} = 16 \times 16 \times 128\) for the third conv block, \(S_{7} = 16 \times 16 \times 256\) and \(S_{8} = 8 \times 8 \times 256\) for the four conv block. Then \(S_{8}\) was flattened and passed through the first fully connected layer with output as \(S_{9} = 1 \times 1 \times 50\). The output of the second fully connected layer was \(S_{10} = 1 \times 1 \times 2\).
Measures
The randomness effect of each run reduced performance reliability, so we used \(K\)-fold cross validation to analyze unbiased performances. The size of each fold is \(\left| {X_{3} } \right|/K\). Due to there being two balanced classes (DCIS and HB), each class will have \(\left| {X_{3} } \right|/\left( {2 \times K} \right)\) images. The split setting of one trial is shown in Table 6. Within each trial, \(\left( {K - 1} \right)\) folds were used as training, and the rest fold were used as test. After combining all \(K\) trials, the test image grew to \(\left| {X_{3} } \right|\). If above \(K\)-fold cross validation repeats \(Z\) runs, the performance will be reported on \(\left| {X_{3} } \right| \times Z\) images.
Suppose the ideal confusion matrix \(E^{{{\text{ideal}}}}\) over the test set at k-th trial and z-th run is
where the constant 2 is because our dataset is a balanced, i.e., DCIS class has the same size of HB. After combining \(K\) trials, the ideal confusion matrix is at \(z\)-th run is
In realistic inference, we cannot get the perfect diagonal matrix as shown in Eq. (36), suppose the z-th run real confusion matrix is
where \(0 \le a,b,c,d \le \left| {X_{3} } \right|/2\). The four variables \(\left( {a,b,c,d} \right)\) represent TP, FN, FP, and TN, respectively. Here P means DCIS and N means healthy breast (HB).
Four simple measures \(\left( {\nu^{1} ,\nu^{2} ,\nu^{3} ,\nu^{4} } \right)\) can be defined as
where \(\left[ {\nu^{1} \left( z \right),\nu^{2} \left( z \right),\nu^{3} \left( z \right),\nu^{4} \left( z \right)} \right]\) means sensitivity, specificity, precision, and accuracy at z-th run, respectively. Besides, F1 score \(\nu^{5} \left( z \right)\), Matthews correlation coefficient (MCC) \(\nu^{6} \left( z \right)\), and Fowlkes–Mallows index (FMI) \(\nu^{7} \left( z \right)\) can be defined as:
After averaging \(Z\) runs, we can calculate the mean \(\left( M \right)\) and standard deviation (SD) of all k-th \(\left( {\forall k \in \left[ {1,7} \right]} \right)\) measures as
The result is reported in the format of \(M \pm {\text{SD}}\). For ease of typing, we write it in short as MSD.
Experiments and results
Parameter setting
Table 4 shows the parameter setting of variables in this study. The values were obtained using trial-and-error. The total size of our dataset was 480, and thus the size of the preprocessed image set is \(\left| {X_{3} } \right| = 480\). The number of folds and runs were all set to 10, i.e., \(K = 10, Z = 10\). Then, each fold contained 48 images, that is 24 DCIS and 24 HB images. The training set contained \(\left| C \right| = 432\) images, and the test set contained \(\left| D \right| = 48\) images. The number of DA ways was \(L = 16\), the number of new images for each DA technique was \(W = 30\). Thus, we created \(L \times W = 480\) new images for every training image. The number of conv layers/blocks was \(N_{{{\text{CL}}}} = 4\), and the number of fully connected layers/blocks was \(N_{{{\text{FCL}}}} = 2\).
Table 5 itemizes the LDA parameter settings. The GC factors \(\eta^{{{\text{GC}}}}\) varied from 0.4 to 1.6 with an increase of 0.04, skipping the value of 1. The rotation vector \(\eta^{{{\text{RO}}}}\) was in the value from \(- W\) to \(W\) an increase of 2°, skipping \(\eta^{{{\text{RO}}}} = 0\). Scaling factor \(\eta^{{{\text{SC}}}}\) varied from 0.7 to 1.3 with an increase of 0.02, skipping \(\eta^{{{\text{SC}}}} = 1\). HS factors \(\eta^{{{\text{HS}}}}\) varied from − 0.15 to 0.15 with an increase of 0.01, skipping the value of \(\eta^{{{\text{HS}}}} = 0\). The maximum shift factor \(\Delta = 15\). The maximum color shift value was \(\varpi = 50\).
Table 6 shows the K-fold cross validation setting, which was used in the experiment to report unbiased performances [28]. For each trial, the training image set contained 216 DCIS and 216 HB images. Then after L-way data augmentation, the LDA training set contained 103,896 images for each class, and thus together \(\left| {{\text{DA}}\left( C \right)} \right| = 207,792\) images. The size of the test set during each trial was only 48 images. Combining 10 trials, the final combined test set is the same as the original dataset of 480 images.
Statistical result of proposed model-4
The ten runs of our Model-4 results are shown in Table 7. Here it shows using our Model-4 CNN-BDER yielded \(\nu^{1} = 94.08 \pm { }1.22\), \(\nu^{2} = 93.58 \pm { }1.49\), \(\nu^{3} = 93.63 \pm { }1.37\), \(\nu^{4} = 93.83 \pm { }0.96\), \(\nu^{5} = 93.85 \pm { }0.94\), \(\nu^{6} = 87.68 \pm { }1.91\), \(\nu^{7} = 93.85 \pm { }0.94\). In summary, our model-4 showed high accuracy, potentially aiding radiologists to make fast and accurate decisions.
Model comparison
We next compared the Model-4 CNN-BDER result with other four models (Model-0 BCNN, Model-1 CNN-BD, Model-2 CNN-BDE, and Model-3 CNN-BDR). The comparison results are shown in Table 8. Here, Model-4 CNN-BDER yielded the best results among all five models. Note that \(\nu^{2}\) and \(\nu^{3}\) of Model-3 CNN-BDR are quite close to those of Model-4 CNN-BDER, but considering the results were obtained using an average of ten runs, we can still conclude that Model-4 CNN-BDER has higher accuracy than Model-3 CNN-BDR in terms of all seven indicators.
Kruskal–Wallis test was preformed based on Model-4 against Model-(m), where \(m = 0,1,2,3\). The p value result matrix \(P\) is listed in Table 9. The null hypothesis is the indicator vector \(\nu^{n} \left( {n = 1, \ldots ,7} \right)\) of \(Z\) runs of Model-(m) and that of Model-4 come from the same distribution, and the alternative hypothesis that not all samples are obtained from the same distribution. Then we recorded the corresponding p value as \(p\left( {m,n} \right)\). The final matrix \(P = \left[ {p\left( {m,n} \right)} \right],m = 0, \ldots ,3,n = 1, \ldots ,7\). Note here we chose \(Z = 30\). The reason is our data are not normally distributed (see Table 7), so it is important to obtain a larger sample set.
The first row and second row of Table 9 show that all p values are < 0.05. So, the test rejects the null hypothesis at the 5% significance level, indicating that Model-4 is significantly better than Model-0 and Model-1 for all seven indicators. For the third row, the p values show that Model-4 is significantly better than Model-2 for all indicators other than sensitivity \(\nu^{1}\). For the last row, the p values show that Model-4 is significantly better than Model-3 for all indicators other than specificity \(\nu^{2}\) and precision \(\nu^{3}\).
Effect of LDA
Table 10 presents the results of not using LDA, showing decreased accuracy compared to those using LDA and highlights the effectiveness of our proposed LDA. The future research direction is to explore more types of DA techniques and increase the diversity of LDA, hence, improving the generalization ability of our AI models. Note that Model-0 BCNN and Model-1 CNN-BD without LDA obtain performances lower than 90%, which are worse than traditional AI methods that do not utilize deep learning. This means deep learning with big data can improve performance, if we do not have big data (not using data augmentation means our training set is only 432 images as shown in Table 6), then deep models may not compete with traditional shallow models.
Figure 5 summarizes and compares all ten models, where LDA and NLDA represent use and non-use of LDA, respectively. From Fig. 5 we can clearly observe that our Model-4 CNN-BDER using LDA can obtain the best performance among all six models.
Here we do not run hypothesis test, since all the models without LDA show reduced performance than the models with LDA. We have already proven that the statement “Model-4 is better than Models-(0–3)” is statistically significant, so we can conclude that Model-4 is better than Models without LDA.
Comparison to state-of-the-art approaches
Our proposed Model-4 CNN-BDER was compared with state-of-the-art approaches. First, we used the 40-image dataset in Ref. [12]. The comparison results are presented in Table 11. Note here the performance of our Model-4 differs from previous experiments, because we analyzed a smaller dataset (40-images). The reason why our method is better than SMFD [12] is because SMFD, i.e., statistical measure and fractal dimension, can help extract statistical and global texture information, but it is inefficient in extracting local information.
Next, we compared our Model-4 with recent state-of-the-art algorithms on the entire 480-image dataset using 10 runs of tenfold cross validation. The comparison algorithms include NBC [8], CRNBC [7], CRSVM [7], WEE [9], SMMP [10], HMI [11], SMFD [12], WESVM [13]. The comparative results are shown in Table 12. Here Ref. [7] provided two methods, one using naive Bayesian classifier, and the other using support vector machine.
The results in Table 12 showed that our Model-4 CNN-BDER method performed better than eight state-of-the-art approaches. Except MCC \(\nu^{6}\), the other six indicators of our method are greater than 93%. While, the second best method is SMFD [12], whose seven indicator values are all less than 91%. SMFD [12] can help extract statistical and global texture information, but it is inefficient when extracting local information. WEE [9] has a similar problem, since wavelet energy entropy uses wavelet to extract multi-resolution information, and a higher decomposition level of wavelet can extract finer-resolution. But it is difficult to run high-level decomposition in practice. Hence, the information from WEE [9] is mostly at a coarse level. CRNBC [7] and CRSVM [7] used co-occurrence matrix (COM) and run length matrix (RLM) as the feature extraction method, and employed naive Bayesian classifier (NBC) and support vector machine (SVM) as classifiers. COM computes the distribution of co-occurring pixel values at given offsets, while RLM computes the size of homogeneous runs for each grey level. Both features are easy to implement for computer scientists, but their capability of distinguishing tumors from surrounding healthy issues needs to be verified. Also, NBC and SVM are traditional classifiers, whose performances are not as high compared to recent deep learning approaches. SMMP [10] combined self-organizing map (SOM) and multilayer perceptron (MLP) methods. SOM used unsupervised learning to generate a low-dimensional discretized representation of the input space from the training image samples, while MLP has only one hidden layer that may limit its expressivity power. WESVM [13] used wavelet energy support vector machine as the classifier. However, wavelet energy is not a popular feature descriptor, whose improvements and modifications on wavelet energy are still in progress. The two worst methods are NBC [8] and HMI [11]. The former assumes the presence/absence of a feature of a class is unrelated to the presence/absence of any other features; however, this assumption is difficult to fulfil in practice. The latter employed seven Hu moment invariants as feature descriptors, which may be insufficient to capture information regarding breast cancer masses. The performance can be improved by combining with other feature descriptors. In all, Table 12 shows the improved performance of our Model-4 CNN-BDER method.
Comparison to manual diagnosis
Three experienced radiologists \(\left( {P_{1} ,P_{2} ,P_{3} } \right)\) were invited to independently inspect our dataset of 480 thermogram images. None of the radiologists had observed any of the images in advance.
The results of three radiologists are itemized in Table 13. The first radiologist \(\left( {P_{1} } \right)\) obtained a sensitivity of 71.67%, a specificity of 74.17%, a precision of 73.50%, and an accuracy of 72.92%. The second radiologist \(\left( {P_{2} } \right)\) obtained the four indicators as 81.25%, 73.75%, 75.58%, and 77.50%, respectively. The third radiologist \(P_{3}\) obtained the four measures as 75.42%, 82.50%, 81.17%, and 78.96%, respectively. Comparing Table 13 with our method Model-4, which is also illustrated in Fig. 6, from which we can see that our proposed CNN-BDER method can give higher performance than manual diagnosis. The reason may be DCIS is Stage 0 of breast cancer, so some lesions are difficult to discern by radiologists while AI can potentially capture those slight and minor lesions.
Conclusions
We built a new DCIS detection system based on breast thermal images. The method CNN-BDER is based on convolutional neural network, and CNN-BDER has three contributions: (1) use of exponential linear unit to replace traditional ReLU function; (2) use of rank-based weighted pooling to replace traditional max pooling and (3) A L-way data augmentation was proposed.
The results show that our Model-4 CNN-BDER method can achieve \(\nu^{1} = 94.08 \pm { }1.22\), \(\nu^{2} = 93.58 \pm { }1.49\), \(\nu^{3} = 93.63 \pm { }1.37\), \(\nu^{4} = 93.83 \pm { }0.96\), \(\nu^{5} = 93.85 \pm { }0.94\), \(\nu^{6} = 87.68 \pm { }1.91\), \(\nu^{7} = 93.85 \pm { }0.94\). Our Model-4 offers improved performance over not only the other four proposed models (Model-0, Model-1, Model-2, and Model-3) validated by Kruskal–Wallis test, but also eight state-of-the-art approaches.
The shortcomings of our proposed Model-4 are threefold: (1) the model has not been verified clinically, but will certainly form the basis of future studies; (2) the model does not work with mammogram images, so we will aim to develop a hybrid model in the future which can help give predictive results regardless of whether the input is a thermogram image, a mammogram image or both.
The future direction will be following aspects: (1) try to expand the dataset and introduce more thermal images; (2) move our AI system online and allow radiologists worldwide to test our algorithm and (3) test other advanced AI algorithms.
References
Weedon-Fekjaer H, Li XX, Lee S (2020) Estimating the natural progression of non-invasive ductal carcinoma in situ breast cancer lesions using screening data. JS Med Screen p 9, Article ID: 0969141320945736
Yoon GY, Choi WJ, Cha JH, Shin HJ, Chae EY, Kim HH (2020) The role of MRI and clinicopathologic features in predicting the invasive component of biopsy-confirmed ductal carcinoma in situ. BMC Med Imaging 20:11
Racz JM, Glasgow AE, Keeney GL, Degnim AC, Hieken TJ, Jakub JW et al (2020) Intraoperative pathologic margin analysis and re-excision to minimize reoperation for patients undergoing breast-conserving surgery. Ann Surg Oncol 27(13):5303–5311
Ng EYK (2009) A review of thermography as promising non-invasive detection modality for breast tumor. Int J Therm Sci 48:849–859
Borchartt TB, Conci A, Lima RCF, Resmini R, Sanchez A (2013) Breast thermography from an image processing viewpoint: a survey. Signal Process 93:2785–2803
Miligy IM, Toss MS, Shiino S, Oni G, Syed BM, Khout H et al (2020) The clinical significance of estrogen receptor expression in breast ductal carcinoma in situ. Br J Cancer. https://doi.org/10.1038/s41416-020-1023-3
Milosevic M, Jankovic D, Peulic A (2015) Comparative analysis of breast cancer detection in mammograms and thermograms. Biomed Eng Biomed Tech 60:49–56
Nicandro CR, Efren MM, Yaneli AAM, Enrique MDM, Gabriel AMH, Nancy PC et al (2013) Evaluation of the diagnostic power of thermography in breast cancer using Bayesian network classifiers. In: Computational and mathematical methods in medicine, Article ID: Unsp 264246, 2013
Chen Y (2018) Wavelet energy entropy and linear regression classifier for detecting abnormal breasts. Multimed Tools Appl 77:3813–3832
Zadeh HG, Montazeri A, Kazerouni IA, Haddadnia J (2017) Clustering and screening for breast cancer on thermal images using a combination of SOM and MLP. Comput Methods Biomech Biomed Eng Imaging Vis 5:68–76
Nguyen E (2018) Breast cancer detection via Hu moment invariant and feedforward neural network. In: AIP conference proceedings, vol 1954, Article ID: 030014, 2018
Muhammad K (2017) Ductal carcinoma in situ detection in breast thermography by extreme learning machine and combination of statistical measure and fractal dimension. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-017-0639-5
Guo Z-W (2018) Breast cancer detection via wavelet energy and support vector machine. In: 27th IEEE international conference on robot and human interactive communication (ROMAN), Nanjing, China, 2018, pp 758–763
Milosevic M, Jankovic D, Peulic A (2014) Thermography based breast cancer detection using texture features and minimum variance quantization. EXCLI J 13:1204–1215
Ann Arbor Thermography. https://aathermography.com/breast/breasthtml/breasthtml.html
Breast Thermography Image dataset (2020). https://www.dropbox.com/s/c7gfp2bo1ae466m/database.zip?dl=0
Silva LF, Saade DCM, Sequeiros GO, Silva AC, Paiva AC, Bravo RS et al (2014) A new database for breast research with infrared image. J Med Imaging Health Inform 4:92–100
Rattay F (1998) Analysis of the electrical excitation of CNS neurons. IEEE Trans Biomed Eng 45:766–772
Górriz JM (2020) Artificial intelligence within the interplay between natural and artificial computation: advances in data science, trends and applications. Neurocomputing 410:237–270
Mainze K (1997) Introduction: from linear to nonlinear thinking. In Thinking in complexity. Springer, Berlin, pp 1–2
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: 27th International conference on machine learning (ICML), Haifa, Israel, 2010, pp 807–814
Clevert D-A, Unterthiner T, Hochreiter S (2016) Fast and accurate deep network learning by exponential linear units (ELUs). arXiv. arXiv:1511.07289v5
Rezaei M, Yang H, Meinel C (2017) Deep neural network with l2-norm unit for brain lesions detection. In: International conference on neural information processing (ICNIP), Cham, 2017, pp 798–807
Jiang YY (2017) Cerebral micro-bleed detection based on the convolution neural network with rank based average pooling. IEEE Access 5:16576–16583
Blok PM, van Evert FK, Tielen APM, van Henten EJ, Kootstra G (2020) The effect of data augmentation and network simplification on the image-based detection of broccoli heads with Mask R-CNN. J Field Robot. https://doi.org/10.1002/rob.21975
Puttaruksa C, Taeprasartsit P (2018) Color data augmentation through learning color-mapping parameters between cameras. In: 15th International joint conference on computer science and software engineering, Mahidol University, Facility ICT, Thailand, 2018, pp 6–11
Pandian JA, Geetharamani G, Annette B, Ieee (2019) Data augmentation on plant leaf disease image dataset using image manipulation and deep learning techniques. In: 9th International conference on advanced computing, MAM College of Engineering and Technology, Tiruchirapalli, India, 2019, pp 199–204
Marcot BG, Hanea AM (2020) What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis? Comput Stat. https://doi.org/10.1007/s00180-020-00999-9
Acknowledgements
The paper is partially supported by British Heart Foundation Accelerator Award, UK; Royal Society International Exchanges Cost Share Award, UK (RP202G0230); Hope Foundation for Cancer Research, UK (RM60G0680); Medical Research Council Confidence in Concept Award, UK (MC_PC_17171); MINECO/ FEDER (RTI2018-098913-B100, A-TIC-080-UGR18), Spain/Europe.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A
See Table 14.
Appendix B
Suppose the input is t, traditional AF is in the form of sigmoid function \(f_{{{\text{sig}}}}\), defined as
with its derivative as
Sigmoid output is in the range of \(\left[ {0,1} \right]\). In some situations, the range \(\left[ { - 1,1} \right]\) is expected. \(f_{{{\text{sig}}}} \left( t \right)\) could be shifted to become the hyperbolic tangent (HT) function
with its derivative as
Nonetheless, the widespread saturation of \(f_{{{\text{sig}}}}\) and hyperbolic tangent function \(f_{{{\text{HT}}}}\) make gradient-based learning (GBL) and its variants perform poorly in the neural network training phase. Hence, rectified linear unit (ReLU) \(f_{{{\text{ReLU}}}}\) has grown in popularity, because it accelerates the convergence of GBL compared to \(f_{{{\text{sig}}}}\) and \(f_{{{\text{HT}}}}\).s
When \(t < 0\), the activation of \(f_{{{\text{ReLU}}}}\) values are zero, so ReLU cannot learn via GBLs, because the gradients are all zero. The leaky ReLU (LReLU) could ease this problem caused by changing hard-zero activation of ReLU. LReLU’s function \(f_{{{\text{LReLU}}}} \left( t \right)\) is defined as
where parameter \(\beta = 0.01\) is the commonly pre-assigned value. Its derivative is defined as
Appendix C
Using Fig. 7 as an example, and assuming the region \({\Phi }\left( {1,1} \right)\) at 1st row 1st column of the input AM I is chosen as \(\Phi \left( {1,1} \right) = I\left( {{\text{row}} = 1,{\text{col}} = 1} \right)\), the row vector of \({\Phi }\left( {1,1} \right)\) is \({\text{vec}}\left[ {{\Phi }\left( {1,1} \right)} \right] = \left( {\begin{array}{*{20}c} 8 & {0.2} & {2.2} & {1.1} \\ \end{array} } \right)\). We can calculate the results of L2P is: \(y_{{\Phi \left( {1,1} \right)}}^{{{\text{L}}2{\text{P}}}} = {\text{sqrt}}\left( {\left( {8^{2} + 0.2^{2} + 2.2^{2} + 1.1^{2} } \right)/4} \right) = {\text{sqrt}}\left( {\left( {64 + 0.04 + 4.84 + 1.21} \right)/4} \right) = 4.19\). The AP result is: \(y_{{\Phi \left( {1,1} \right)}}^{{{\text{AP}}}} = {\text{average}}\left( {\Phi \left( {1,1} \right)} \right) = \left( {8 + 0.2 + 2.2 + 1.1} \right) \div 4 = 2.88\). MP result is: \(y_{{\Phi \left( {1,1} \right)}}^{{{\text{MP}}}} = \max \left( {\Phi \left( {1,1} \right)} \right) = \max \left( {8,0.2,2.2,1.1} \right) = 8\). For the RWP, we first calculate the rank matrix is \({\text{vec}}\left( R \right) = \left( {\begin{array}{*{20}c} {r_{11} } & {r_{12} } & {r_{21} } & {r_{22} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} 1 & 4 & 2 & 3 \\ \end{array} } \right)\). Thus, \(\begin{array}{*{20}c} {{\text{vec}}\left( E \right) = \left( {\begin{array}{*{20}c} {e_{11} } & {e_{12} } & {e_{21} } & {e_{22} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} \frac{1}{2} & {\frac{1}{{2^{4} }}} & {\frac{1}{{2^{2} }}} & {\frac{1}{{2^{3} }}} \\ \end{array} } \right)} \\ \end{array}\). Finally, the RWP result is calculated as \(y_{{\Phi \left( {1,1} \right)}}^{{{\text{RWP}}}} = \frac{8}{2} + \frac{0.2}{{2^{4} }} + \frac{2.2}{{2^{2} }} + \frac{1.1}{{2^{3} }} = 4.70\).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhang, YD., Satapathy, S.C., Wu, D. et al. Improving ductal carcinoma in situ classification by convolutional neural network with exponential linear unit and rank-based weighted pooling. Complex Intell. Syst. 7, 1295–1310 (2021). https://doi.org/10.1007/s40747-020-00218-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40747-020-00218-4