Abstract
The novel coronavirus pneumonia (COVID-19) is the world’s most serious public health crisis, posing a serious threat to public health. In clinical practice, automatic segmentation of the lesion from computed tomography (CT) images using deep learning methods provides an promising tool for identifying and diagnosing COVID-19. To improve the accuracy of image segmentation, an attention mechanism is adopted to highlight important features. However, existing attention methods are of weak performance or negative impact to the accuracy of convolutional neural networks (CNNs) due to various reasons (e.g. low contrast of the boundary between the lesion and the surrounding, the image noise). To address this issue, we propose a novel focal attention module (FAM) for lesion segmentation of CT images. FAM contains a channel attention module and a spatial attention module. In the spatial attention module, it first generates rough spatial attention, a shape prior of the lesion region obtained from the CT image using median filtering and distance transformation. The rough spatial attention is then input into two 7 × 7 convolution layers for correction, achieving refined spatial attention on the lesion region. FAM is individually integrated with six state-of-the-art segmentation networks (e.g. UNet, DeepLabV3+, etc.), and then we validated these six combinations on the public dataset including COVID-19 CT images. The results show that FAM improve the Dice Similarity Coefficient (DSC) of CNNs by 2%, and reduced the number of false negatives (FN) and false positives (FP) up to 17.6%, which are significantly higher than that using other attention modules such as CBAM and SENet. Furthermore, FAM significantly improve the convergence speed of the model training and achieve better real-time performance. The codes are available at GitHub (https://github.com/RobotvisionLab/FAM.git).
Similar content being viewed by others
1 Introduction
According to the report from the Center for Systems Science and Engineering (CSSE) of Johns Hopkins University, until May 25, 2022, COVID-19 has resulted in 526,824,747 infections, of which, 6,280,794 deaths. Rapid detection of the infection is essential to prompt isolation and treatment of the patients. At present, reverse transcription-polymerase chain reaction (RT-PCR) is the most widely adopted method for COVID-19 diagnosis. However, RT-PCR suffers from some drawbacks such as time consuming, false negative caused by the sampling quality [1]. The chest computed tomography (CT) images captured from COVID-19 patients frequently include patchy bilateral shadows or ground-glass opacity in the lung [2], hence chest CT is adopted as an dominant method for the diagnosis of COVID-19 [1, 3]. Compared with RT-PCR, chest CT image is easy to obtain in the clinical practice, therefore it can be used for the severity classification of COVID-19 patients, for which, contouring the lesion is an essential procedure. The traditional manual contouring is tedious, time-consuming, and heavily depending on doctor’s clinical experience, therefore, there is an urgent need for an automated lesion segmentation method specially designed for COVID-19 CT images.
Nowadays, convolution neural network (CNN), a typical deep learning method, is becoming an essential for the segmentation of COVID-19 CT image. Widely used CNNs include FCN [4], SegNet [5], UNet series (UNet [6], UNet++ [7], UNet3+ [8], etc.), and DeepLab series [9,10,11]. These methods usually rely on a large-scale dataset with high-quality pixel-level annotation of COVID-19 lesions. The need for large-scale data collection and data labeling before the model training prevents it from wide adoption in the context of public health.
Attention mechanism is a technology widely used in the fields such as natural language processing (NLP), statistical learning, image processing, voice recognition. It stems from the particularly selective attention of human vision. Attention mechanism focuses on important information with high weights, ignores unrelated information with low weights, and continuously adjusts weights so that important information can be selected in different situations, thus making it expandable and robust. With the consideration of application contexts, attention can be grouped into spatial, channel, layer, mixed, and temporal domains. Spatial and channel domains are most widely used in the tasks of image processing. Many excellent attention modules such as SENet [12], CBAM [13], SKNet [14] are proposed. To improve the accuracy of deep learning models for COVID-19 lesion segmentation, attention modules (or their variations) have been integrated into the state-of-the-art segmentation networks. However, existing attention modules always cannot fully utilize the characteristics of CT images. Moreover, they always disrupt the original feature distribution of the input data, resulting in low segmentation accuracy and the inefficiency of the network training convergence.
To address the abovementioned issue, we propose a novel design, in which the lesion in CT image is treated as rough spatial attention and then combined with a channel attention module to achieve a novel plug-and-play attention module, (named focal attention module (FAM)) for lesion segmentation of COVID-19 CT images. The main contributions of this study are as follows:
-
1
A novel spatial attention module is proposed. It introduces the shape prior information of the lesion region to improve the feature analysis and weight redistribution of the attention module and accelerates the convergence of the network training.
-
2
We sequentially combine the spatial attention module in the form of the residual block with the channel attention module, constructing a novel Focal Attention Module (FAM) for lesion segmentation of COVID-19 CT images.
-
3
FAM is integrated into six state-of-the-art networks and is validated on the public COVID-19 CT image dataset.
The rest of the paper is organized as follows: Sect. 2 describes the work related to the proposed method. Section 3 details the design and implementation of the proposed method. The experiment and discussion is described in Sect. 4. Finally, we conclude the study in Section.
2 Related work
In this section, we first discuss these existing deep learning-based methods for lesion segmentation of COVID-19 CT images, followed by related work on the attention mechanism, in the end we introduce the applications of shape priors in image segmentation.
2.1 Lesion segmentation of COVID-19 CT images
The data annotation is usually with labor cost and time-consuming, large-scale segmentation datasets of COVID-19 lesions are rarely available. Meanwhile, training networks on a small-scale dataset suffers from the issues such as over-fitting and poor generalization performance. Existing deep learning methods are proposed to attenuate these models’ reliance on a large-scale dataset. The attention mechanism is used to enhance the capability of feature extraction of the network. For example, Fan et al. [15] combined a semi-supervised learning model and FCN8s network with implicit reverse attention and explicit edge attention mechanism to achieve a novel model. It achieves a sensitivity of 72.5% and an accuracy of 96.0%. Chen et al. [16] proposed a residual attention UNet and applied a soft attention mechanism to enhance the capability of feature learning of the model. The proposed model achieves a performance with a segmentation accuracy of 89%. Zhao et al. [17] integrated their proposed spatial-wise and channel-wise attention modules on UNet++ [7]. The Dice Similarity Coefficients (DSC) of the model is 88.99%. A number of novel loss functions and special network modules are also proposed. For example, Wang et al. [18] proposed noise-robust dice loss to solve the problem of poor training results caused by low-quality labels, and the DSC of the model is 80.72%. Inspired by contrast enhancement methods and Atrous Spatial Pyramid Pooling (ASPP) [10], Yan et al. [19] proposed a novel Progressive Atrous Spatial Pyramid Pooling (PASPP) module to progressively aggregate information and obtain more useful contextual features, and the DSC of the model is 72.60%. Elharrouss et al. [20] proposed a multi-class segmentation network based on an encoder-decoder structure, and the multi-input stream of the network allows the model to learn more features. It achieves a sensitivity of 71.1%. In addition, multi-scale features fusion [21], multipoint supervised training [22], and conditional generation model [23] are promising for improving the segmentation accuracy of COVID-19 lesions.
2.2 Attention mechanism
Attention is an essential and complex cognitive function in the human brain [24]. With attention, people can work methodically while receiving a large amount of information through vision, hearing, touch, etc. The human brain can select small portions of interested information from these large amounts of input information to focus on, meanwhile ignoring other portions.
In the context of computer vision, attentions can be divided into soft attention and hard attention [25]. For soft attention, by calculating the attention weight, all data is included in the attention range, and no filter condition for the data feature is set. Hard attention sets the filtering condition after calculating the attention weight and forms a part of the attention weight value that does not meet the condition to 0. Contrarily, soft attention is probabilistic, end-to-end differentiable, and utilizes back-propagation and forward-propagation to learn the attention weight without the posterior sampling. There are a number of studies regarding the soft attention. Inspired by translation and rotation without deformation of the pooling mechanism, Jaderberg et al. [26] proposed a spatial transformation module that could learn the transformation from the network. It was widely used for Optical Character Recognition (OCR). Hu et al. [12] proposed a channel attention model (SENet), but SENet cannot capture spatial contextual information. Woo et al. [13] expanded the SENet and proposed an attention module (CBAM) to constrain and enhance the input feature map from the channel and spatial dimensions. But, the spatial attention module of CBAM fails to capture information at different scales and is not able to establish a long-range dependency. Inspired by the classical non-local means method [27] for image processing, Wang et al. [28] proposed an attention module (non-local neural networks) for capturing long-range dependencies. Fu et al. [29] amalgamated the advantages of CBAM and Non-local Neural Networks to propose the DANet, an attention module widely used in semantic segmentation. Drawing on the idea of residual networks, Wang et al. [30] proposed a novel solution to solve the problem of information reduction caused by stacked attention modules. A Criss-Cross Attention is proposed by Huang et al. [31], to reduce the calculations of Non-local Neural Networks. Gao et al. [32] proposed a Spatially Modulated Co-Attention (SMCA) mechanism to accelerate training convergence, but it suffers from the increased time of computation and inference. A particular channel attention module [33] was proposed to distinguish the esophagus and surrounding tissues from esophageal cancer. However, there are limited literatures regarding the hard attention, and studies [34,35,36] argued that reinforcement learning is required for training in hard attention due to its non-differentiability.
Although there are a number of studies [16, 17, 37, 38] introducing the attention mechanism for lesion segmentation on COVID-19 CT images, the improvement in the performance and accuracy of these models is still urgently expected in academia and industry.
2.3 Shape priors in image segmentation
Traditional segmentation methods (e.g., thresholding, watershed, and region growing) usually suffer from the lack of robustness and poor segmentation accuracy due to the noise, low contrast, and complexity of objects in medical images. Recently, the rapid development of deep learning methods promoted the adoption of deep learning-based image segmentation algorithms in medical image segmentation. Studies [39,40,41,42,43] have shown that integrating prior knowledge of objects into rigorous segmentation formulas can improve the segmentation accuracy of a specific target. The prior knowledge has been utilized in various forms, e.g., user interaction, object shape and appearance [44].
The shape is one of the most important geometric attributes of anatomical objects, and shape priors can reduce the search space of the potential segmentation outputs for deep learning models [45]. Ravishankar et al. [46] incorporated the shape model explicitly in FCN through a novel loss function that penalizes the deviation of the predicted segmentation mask from a learned shape model. Avanti et al. [47] used stacked automatic encoders to infer the target shape, then the inferred shape is incorporated into deformable models to improve the accuracy and robustness. In addition, Ngo et al. [48] and Cremers et al. [40] combined level set and genetic algorithms with deep learning to improve the training effect of the model on small datasets. Zhao et al. [49] obtained the shape prior of the lung region through threshold segmentation to optimize the segmentation of the lung.
3 Proposed method
3.1 Design rationale of focal attention module
Given an intermediate feature map \(\mathbf{F}\in {\mathbb{R}}^{C\times H\times W}\), and an input CT image \(\mathbf{I}\in {\mathbb{R}}^{1\times H\times W}\) is defined as the input. FAM sequentially infers a 1D channel attention map \({{\varvec{M}}}_{{\varvec{c}}}\in {\mathbb{R}}^{C\times 1\times 1}\) and a 2D spatial attention map \({\mathbf{M}}_{s}\in {\mathbb{R}}^{1\times H\times W}\) as illustrated in Fig. 1. The attention module is formularized as:
where \(\otimes\) denotes element-wise multiplication. During multiplication, the attention values are broadcasted accordingly: channel attention values are broadcasted along the channel dimension, and vice versa. \({\mathbf{F}}^{^{\prime\prime} }\) is the final refined output. Different from naive stacking attention modules (e.g., CBAM, SENet), the feature map refined by the spatial attention module (as depicted in Fig. 2) is combined as a residual branch into the feature map refined by the channel attention module due to the following analysis:
-
1.
When the input \(\mathbf{I}\) is a negative sample, spatial attention obtained from its lung image \(\mathbf{L}\) by distance transformation contains less feature information. In this case, stacking will degrade the value of features in deep layers.
-
2.
Residual branch works as feature selectors, which enhance good features and suppress noise from trunk features.
-
3.
Inspired by Residual Attention Network [38], attention residual learning not only keeps good properties of original features but also allows to be refined by the spatial attention module.
3.2 Channel attention module
The channel attention module focuses on “what” is meaningful given the feature maps. To compute the channel attention efficiently, spatial information of a feature map is first aggregated by average-pooling and max-pooling operations, respectively, thus two different spatial context descriptors (i.e. \({\mathbf{F}}_{\text{avg }}^{\mathrm{c}}\) and \({\mathbf{F}}_{max}^{\mathrm{c}}\)) are obtained. Both descriptors are then forwarded to a multi-layer perceptron (MLP) with one hidden layer, achieving two output feature maps. Finally, the output feature maps are merged using element-wise summation. To reduce the number of parameters, the hidden layer size is set to \({\mathbb{R}}^{C/r\times 1\times 1}\), where \(r\) is the reduction ratio. The channel attention is formularized as:
where \(\sigma\) represents the sigmoid function, \({\mathbf{W}}_{0}\in {\mathbb{R}}^{C/r\times C}\) and \({\mathbf{W}}_{1}\in {\mathbb{R}}^{C\times C/r}\). \({\mathbf{W}}_{0}\) and \({\mathbf{W}}_{1}\) are the weights of the MLP and are shared for both inputs. The ReLU activation function is followed by \({\mathbf{W}}_{0}\).
3.3 Spatial attention module
The spatial attention focuses on “where’” is the interested in given feature maps. The spatial attention in CBAM is learned from a 7 × 7 convolution layer, and it has two shortcomings when dealing with the specific task of lesion segmentation for COVID-19 diagnosis: (1) low efficiency on learning process; (2) the change of spatial attention to the feature space easily cause the problematic convergence of training and poor network generalization performance, especially when the dataset is small or the parameters of the backbone network are few. Two mechanisms are introduced to address these issues: (1) adopting the residual structure and refined the feature maps with spatial attention while preserving trunk network features (as shown in Fig. 1); (2) utilizing the shape prior of the COVID-19 lesion region to reduce the search space of the spatial attention module. The main steps (as shown in Fig. 2b) while computing the spatial attention include lung segmentation, median filtering, and distance transformation.
3.3.1 Lung segmentation
To efficiently obtain the shape prior of the lesion region, the lung needs to be segmented from CT images. Currently, many excellent methods of lung segmentation have been proposed and widely used. These methods are mainly divided into three types: traditional image processing-based algorithms, deep learning-based algorithms and the combination of the two former methods. Because segmentation of the lung is not the focus of this paper, the lung region is segmented with a simple mask operation from labels in the dataset.
3.3.2 Median filtering
Median filtering is introduced to eliminate partial noise pixels consisting of the pulmonary trachea and pulmonary vessels from the lung image. Median filtering, a nonlinear method that can preserve the details of the edges of an image while eliminating noise, has been widely used in fields such as image enhancement and image recovery. As shown in Fig. 3a, a few noise pixels in the lung region, such as regions of the tiny pulmonary trachea and pulmonary vessels, interfere significantly with the accurate segmentation of lesions. As shown in Fig. 3b, the median filtering eliminates most of the small pulmonary trachea and pulmonary vessels. For the large pulmonary trachea and pulmonary vessels, the median filtering also reduces their pixel region. Meanwhile, median filtering retains the nature of the lesions (i.e. ground-glass opacity) with little reduction in the area of the lesions due to its large pixel region.
3.3.3 Distance transformation
Distance transformation (DT) is to convert a digital binary image that consists of object and non-object pixels into another image in which each object pixel owns a value corresponding to the minimum distance from the non-object by a distance function [50, 51]. Distance transformation is widely used for target thinning, object skeleton extraction. Euclidean distance, city block distance, and chessboard distance are widely used measures for distance transformation. The full workflow of distance transformation is introduced as follows:
Given an image \(J\), it’s binarized to get an image \({J}_{b}\). In \({J}_{b}\), 1 is associated with object pixel and 0 with the background pixel. Hence, we have a pixel set \(\mathcal{O}\) represented by all the object pixels and \({\mathcal{O}}^{c}\) represented by all the background pixels.
where \(t\) and \(b\) represent the pixel of objects and background respectively. The distance transformation (\(DT\)) generates a map \(D\), in which the value of each pixel in \(\mathcal{O}\) is the smallest distance from this pixel to \({\mathcal{O}}^{c}\):
where the image \(D\) is called the distance map of \(J\). It is assumed that \(\mathcal{O}\) contains at least one pixel. Otherwise, the output of the \(DT\) is undefined, i.e., the outliers will be ignored in the distance transformation. Moreover, \(d(t,b)\) represents Euclidean distance, is formularized as:
where \(H\) and \(W\) represent the height and width of the image \(J\) respectively.
As shown in Fig. 3c, distance transformation is used to eliminate noise pixels (i.e., the pulmonary trachea and pulmonary vessels) and extract the main lesion region. By applying sequential median filtering and distance transformation, the distribution of connected regions in a lung image is shown in Fig. 4. The connected regions containing more than 200 pixels represent the lesion region, and those with a small area represent the region of the pulmonary trachea and pulmonary vessels. By applying distance transformation, the distribution of connected regions in the lung image is close to the ground truth. Distance maps of several lung images are shown in Fig. 5. By comparing the distance map with the corresponding lesion label, the main lesion region is extracted.
Normalization serves as an activation function to obtain the shape prior of the lesion region. As shown in Fig. 6b, distance transformation strengthens the weight of the main connected regions and weakens the weight of the edge parts (as shown in Fig. 6a). Furthermore, as shown in Fig. 6c, normalization removes both the edge and connected regions with a small area. The normalization function \(Norm\) is formularized as:
where \(X\) represents an image matrix, \({X}_{\mathrm{min}}\) and \({X}_{\mathrm{max}}\) represent the minimum and maximum values in \(X\) respectively. Next, as illustrated in Fig. 2b, two 7 × 7 convolution layers are utilized to learn the attention weight of edge pixels which has a low boundary contrast to surroundings, as well as adaptively tuning up attention weights to other regions. Finally, refined spatial attention is obtained. Spatial attention is formularized as:
where \(\mathbf{I}\) is the input CT image, \(\sigma\) represents the sigmoid function, \({f}^{\mathrm{seg}}\) and \({f}^{\text{med}}\) represent the lung segmentation network and median filtering, respectively. \({f}_{7*7}^{\mathrm{cov}}\) is the 7 × 7 convolution layer. \(\mathrm{DT}\) represents the distance transformation.
4 Experiment
4.1 Dataset and metrics
The dataset contains 20 groups of labeled CT scans of COVID-19 patients. Two radiologists manually labeled the regions of left lung, right lung, and COVID-19 lesions on the images and an extra verification of the labeled regions was carried out by another radiologist. CT slices are exported from the CT scans as 2D images (752 × 670 pixels), and a group of processed images are listed in Fig. 7. The dataset contains a total of 3520 CT images and is divided into a training set and a testing set at a ratio of 4:1 (as detailed in Table 1). It is noted that “positive” means CT images with the lesion, and “negative” means healthy CT images.
The Dice Similarity Coefficients (DSC), false negatives (FN), false positives (FP) and inference time of the networks are adopted as evaluation metrics. DSC is a standard metric for comparing the pixel-wise results between the ground truth and predicted segmentation. It is formularized as follows:
where \(A\) is the lesion label, and \(B\) denotes the segmented lesion image.
4.2 Training method
An Adam optimizer with a learning rate of 0.001 is employed to minimize the binary cross-entropy (BCE) loss. The total number of training epochs and batch size is set to 100 and 1, respectively. The network weights are initialized with Kaiming initialization, and the network biases are initialized by 0. Moreover, the positive and negative samples are trained alternately, and the dataset is shuffled in each iteration. The formula of the BCE loss is as follows:
where \(G\) is the label and \(P\) is the output of the network.
4.3 Ablation analysis
FAM can be conveniently integrated into any CNNs as a lightweight and plug-and-play attention module. As shown in Fig. 8, FAM is integrated into convolution layers to refine the intermediate feature map. FAM and two state-of-the-art attention modules (i.e. SENet [19] and CBAM [20]) are respectively inserted into every two convolution layers of six state-of-the-art segmentation networks [6,7,8,9,10,11, 16, 21] for ablation experiments. All the related networks and modules are reproduced in the framework PyTorch, trained and inferenced on a single NVIDIA GeForce RTX 2080Ti GPU with CUDA v10.2 and cuDNN v7.6.5. The main concerns in this ablation analysis contains: shape prior, time complexity, the performance of the network and convergence rate of the network training.
4.3.1 Shape prior
FAM is constructed based on the shape prior and the spatial attention module of CBAM. In the case that CBAM and FAM are integrated respectively between each convolution layer in SegNet (as shown in Fig. 9). A visualization of the spatial attention is built in Fig. 10 to illustrate the effect of shape prior. Since that SegNet is a typical encode-decode structure, in which, low-level features and high-level semantic features are extracted accordingly, the feature map contained in the middle layers is abstract. CBAM adopts such a feature extraction workflow because of the connection between its spatial attention module and the network, i.e., the input of the spatial attention module comes from inside the network rather than outside. Thus, the spatial attention of CBAM integrated into the middle layers is abstract, and that integrated into other layers highlights only the lung region rather than the lesion region, resulting in the inaccuracy of the spatial attention learning of CBAM. The shape prior information of the lesion region is introduced in the spatial attention module to reduce the search space. In summary, FAM focuses on the lesion region without being disturbed by the no-lesion region during the spatial attention learning process.
4.3.2 Time complexity
Time complexity determines the training and inference time of the network. The network with high time complexity suffer from poor real-time performance. Floating Point Operations (FLOPs) is a classical metric of the time complexity. The time complexity of a network is formularized as follows:
where \(l\) denotes the number of layers in the network. \({f}_{FLOPs}\) represents the function that calculates the FLOPs of a layer. FLOPs for each type of layer is formularized as:
where \({f}_{\mathrm{FLOPs}}^{\mathrm{conv}}\), \({f}_{\mathrm{FLOPs}}^{\mathrm{linear}}\), \({f}_{\mathrm{FLOPs}}^{\mathrm{pooling}}\), \({f}_{\mathrm{FLOPs}}^{\mathrm{relu}}\) and \({f}_{\mathrm{FLOPs}}^{\mathrm{sigmoid}}\) represent the function that calculate the FLOPs of convolution layer, full connect layer, global pooling layer, Relu layer and sigmoid layer respectively. These five types of layers are used in FAM, SENet and CBAM. \({C}_{in}\) and \({C}_{out}\) denote the channel number of the input and output feature map respectively. \({K}_{w}\) and \({K}_{h}\) denote the width and height of the convolution kernel respectively. \({Q}_{w}\) and \({Q}_{h}\) denote the width and height of the input feature map respectively. \({N}_{\mathrm{in}}\) and \({N}_{\mathrm{out}}\) denote the number of input and output neurons. We set the channel number of the input feature map to 16 and set the reduction ratio of the channel attention module to 16. The size of the input feature map is 652 × 752. FLOPs of FAM, SENet and CBAM are listed in Table 2. SENet achieves the smallest FLOPs because it lacks a spatial attention module. FAM and CBAM share a similar structure, but FLOPs of the latter is less than the former by about 15 million. It is because FAM owns fewer pooling layers than CBAM. As shown in Table. 3, FLOPs of six state-of-the-art segmentation networks increase very little when FAM is integrated between every two convolution layers. Moreover, the inference time of FAM is 4 ms faster than that for CBAM on average. In conclusion, FAM is able to greatly improve the performance of the network with very little time complexity increasement.
4.3.3 Performance analysis of the networks with attention modules
The average DSC, FN and FP of all images in the test set are adopted to analyze the performance of the integration of attention modules to the networks (as detailed in Table 3). FAM adds almost no extra parameters to the network, as it only has one more 7 × 7 convolution than CBAM. The integration of FAM improves the DSC of all these networks, with an improvement of 2% for SegNet. In addition, the integration of FAM reduces the FN and FP, with a reduction of 17.6% for PSPNet.
SENet improves the performance of UNet, PSPNet and UNet++, meanwhile it degrades the performance of FCN, SegNet and deeplabV3+. CBAM improves the performance of SegNet and PSPNet, meanwhile it degrades the performance of FCN, UNet, Deeplabv3 and UNet++. This finding shows that the effectiveness of the integration of the attention module depends on the structure of the network.
Figure 11. shows the segmentation result of SegNet integrated with different attention modules on the dataset. As shown in Fig. 11, SegNet without the integration of the attention module suffers from false detection in lesions of both left and right lobes to some extent. The integration of SENet alleviates it but CBAM exacerbates it. Although the SegNet integrated with FAM has many false detections, overall it has the highest segmentation accuracy.
4.3.4 Convergence analysis of the network training
Six networks with attention modules are trained with the same experimental settings and dataset. As shown in Fig. 12, all networks converge within 60 iterations. FAM accelerates model training better than SegNet or CBAM does. For SegNet, UNet++ and DeepLabV3+, SENet does not significantly accelerate the model training. For UNet and UNet++, CBAM decelerates the convergence of the model training. However, FAM accelerates the training of FCN, UNet, SegNet, PSPNet and DeepLabV3+, as well as minimizing the loss.
For UNet++, FAM performs better than the other two attention modules. For UNet and PSPNet, FAM makes the training process more stable. The result shows that FAM achieves better convergence rate and less converged loss value of the model training among these six networks than CBAM does. In addition, although SENet only applies its attention in the channel dimension, it achieves better performance than CBAM in specific networks such as FCN, UNet and PSPNet.
5 Conclusion
In this study, a lightweight and plug-and-play attention module, is proposed to improve the lesion segmentation performance of CNNs for COVID-19 CT images. FAM refines the input feature map from channel and space dimensions to maximize the network representation. In the spatial attention of FAM, shape prior of the lesion region is used to reduce the search space for attention learning. In addition, the feature map refined by spatial attention is added to the network as a residual branch. A set of experiments proved that: (1) FAM could improve the segmentation performance on a small-scale public COVID-19 CT image dataset; (2) FAM could accelerate the convergence speed of the model training; (3) FAM is capable of being stacked in a deep segmentation network without performance loss. (4) FAM could achieve better real-time performance.
FAM is promising for practical use in public health. In future, we will work towards improving the generated shape prior to enhance the generalization performance of FAM based on the up-to-date COVID-19 CT image datasets.
Data availability
The datasets generated during and analyzed during the current study are available from the corresponding author on reasonable request.
References
Ai, T., Yang, Z., Hou, H., et al.: Correlation of chest CT and RT-PCR testing for coronavirus disease 2019 (COVID-19) in China: a report of 1014 cases. Radiology 296(2), E32–E40 (2020). https://doi.org/10.1148/radiol.2020200642
Adams, H.J., Kwee, T.C., Yakar, D., et al.: Chest CT imaging signature of coronavirus disease 2019 infection: in pursuit of the scientific evidence. Chest 158(5), 1885–1895 (2020). https://doi.org/10.1016/j.chest.2020.06.025
Xu, X., Tian, H., Zhang, X., Qi, L., He, Q., Dou, W.: DisCOV: distributed COVID-19 detection on X-ray images with edge-cloud collaboration. IEEE Trans. Serv. Comput. (2022). https://doi.org/10.1109/TSC.2022.3142265
Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2016). https://doi.org/10.1109/TPAMI.2016.2572683
Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017). https://doi.org/10.1109/TPAMI.2016.2644615
Ronneberger, O., Fischer, P., Brox, T.: “U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. Springer, pp 234–241 (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., et al.: Unet++: a nested u-net architecture for medical image segmentation. In: Deep learning in medical image analysis and multimodal learning for clinical decision support, pp. 3–11. Springer (2018). https://doi.org/10.1007/978-3-030-00889-5_1
Huang, H., Lin, L., Tong, R., Hu, H., Zhang, Q., Iwamoto, Y., Han, X., Chen, Y.-W., Wu, J.: Unet 3+: A full-scale connected UNET for medical image segmentation. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 1055–1059. (2020) https://doi.org/10.1109/ICASSP40776.2020.9053405
Chen, L.-C., Papandreou, G., Kokkinos, I. et al.: Semantic image segmentation with deep convolutional nets and fully connected CRFS. (2014) [Online]. https://arxiv.org/abs/1412.7062
Chen, L.-C., Papandreou, G., Kokkinos, I., et al.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017). https://doi.org/10.1109/TPAMI.2017.2699184
Florian, L.-C., Adam, S. H.: Rethinking atrous convolution for semantic image segmentation. In: Conference on Computer Vision and Pattern Recognition (CVPR). IEEE/CVF, (2017) [Online]. https://arxiv.org/abs/1706.05587
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 2011–2023 (2020). https://doi.org/10.1109/TPAMI.2019.2913372
Woo, S., Park, J., Lee, J.-Y., Kweon, I. S.: Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp. 3–19 (2018) https://doi.org/10.1007/978-3-030-01234-2_1
Wu, W., Zhang, Y., Wang, D., Lei, Y.: Sk-net: deep learning on point cloud via end-to-end discovery of spatial keypoints. Proc. AAAI Conf. Artif. Intell. 34(04), 6422–6429 (2020). https://doi.org/10.1609/aaai.v34i04.6113
Fan, D.P., Zhou, T., Ji, G.P., et al.: Inf-Net: automatic COVID-19 lung infection segmentation from CT images. IEEE Trans. Med. Imaging PP(99), 1–1 (2020). https://doi.org/10.1109/TMI.2020.2996645
Chen, X., Yao, L., Zhang, Y.: Residual attention u-net for automated multi-class segmentation of covid-19 chest CT images. (2020) [Online]. Available: https://arxiv.org/abs/2004.05645
Zhao, S., Li, Z., Chen, Y., et al.: SCOAT-Net: aa novel network for segmenting COVID-19 lung opacification from CT images. Pattern Recogn. 119, 108109 (2021). https://doi.org/10.1016/j.patcog.2021.108109
Wang, G., Liu, X., Li, C., et al.: A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images. IEEE Trans. Med. Imaging 39(8), 2653–2663 (2020). https://doi.org/10.1109/TMI.2020.3000314
Yan, Q., Wang, B., Gong, D. et al.; COVID-19 chest CT image segmentation—a deep convolutional neural network solution. (2020) [Online]. Available: https://arxiv.org/abs/2004.10987
Elharrouss, O., Subramanian, N., Al-Maadeed, S.: An encoder-decoder-based method for COVID-19 lung infection segmentation. (2020) [Online]. Available: https://arxiv.org/abs/2007.00861
Qiu, Y., Liu, Y., Li, S., Xu, J.: Miniseg: an extremely minimum network for efficient COVID-19 segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 6, (2021) 4846–4854. https://ojs.aaai.org/index.php/AAAI/article/view/16617
Pei, H.-Y., Yang, D., Liu, G.-R., et al.: MPS-net: multi-point supervised network for CT image segmentation of covid-19. IEEE Access 9, 47144–47153 (2021). https://doi.org/10.1109/ACCESS.2021.3067047
Zhang, P., Zhong, Y., Deng, Y., et al.: CoSinGAN: learning COVID-19 infection segmentation from a single radiological image. Diagnostics 10(11), 901 (2020). https://doi.org/10.3390/diagnostics10110901
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998). https://doi.org/10.1109/34.730558
Wang, F., Tax, D. M.: Survey on the attention based RNN model and its applications in computer vision. (2016) [Online]. Available: https://arxiv.org/abs/1601.06823
Jaderberg, M., Simonyan, K., Zisserman, A.: Spatial transformer networks. Adv. Neural Inform. Process. Syst. 28 (2015). https://dl.acm.org/doi/abs/https://doi.org/10.5555/2969442.2969465
Buades, A., Coll, B., Morel, J.-M.: A non-local algorithm for image denoising. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2. IEEE, pp. 60–65 (2005). https://doi.org/10.1109/CVPR.2005.38
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803 (2018). https://doi.org/10.1109/CVPR.2018.00813
Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3146–3154 (2019). https://doi.org/10.1109/CVPR.2019.00326
Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.: Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164 (2017). https://doi.org/10.1109/CVPR.2017.683
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 603–612 (2019). https://doi.org/10.1109/TPAMI.2020.3007032
Gao, P., Zheng, M., Wang, X., Dai, J., Li, H.: Fast convergence of detr with spatially modulated co-attention (2021) [Online]. https://doi.org/10.48550/arXiv.2108.02404
Huang, G., Zhu, J., Li, J., Wang, Z., Cheng, L., Liu, L., Li, H., Zhou, J.: Channel-attention U-Net: channel attention mechanism for semantic segmentation of esophagus and esophageal cancer. IEEE Access 8, 122798–122810 (2020). https://doi.org/10.1109/ACCESS.2020.3007719
Zhao, B., Wu, X., Feng, J., et al.: Diversified visual attention networks for fine-grained object classification. IEEE Trans. Multimed. 19(6), 1245–1256 (2017). https://doi.org/10.1109/TMM.2017.2648498
Mnih, V., Heess, N., Graves, A.: Recurrent models of visual attention. Adv. Neural Inform. processing Syst. 27 (2014). https://dl.acm.org/doi/abs/https://doi.org/10.5555/2969033.2969073
Liu, X., Xia, T., Wang, J. et al.: Fully convolutional attention networks for fine-grained recognition. (2016) [Online]. https://arxiv.org/abs/1603.06765
Zhao, X., Zhang, P., Song, F. et al.: D2a u-net: automatic segmentation of COVID-19 lesions from CT slices with dilated convolution and dual attention mechanism. (2021) [Online]. https://arxiv.org/abs/2102.05210
Zhou, T., Canu, S., Ruan, S.: Automatic COVID-19 CT segmentation using U-Net integrated spatial and channel attention mechanism. Int. J. Imaging Syst. Technol. 31(1), 16–27 (2021). https://doi.org/10.1002/ima.22527
Zhou, X., Xu, X., Liang, W., Zeng, Z., Yan, Z.: Deep-learning- enhanced multitarget detection for end-edge-cloud surveillance in smart IoT. IEEE Internet Things J. 8(16), 12588–12596 (2021). https://doi.org/10.1109/JIOT.2021.3077449
Cremers, D., Osher, S.J., Soatto, S.: Kernel density estimation and intrinsic alignment for shape priors in level set segmentation. Int. J. Comput. Vis. 69(3), 335–351 (2006). https://doi.org/10.1007/s11263-006-7533-5
Li, K., Tao, W.: Adaptive optimal shape prior for easy interactive object segmentation. IEEE Trans. Multimed. 17(7), 994–1005 (2015). https://doi.org/10.1109/TMM.2015.2433795
Wang, H., Zhang, H.: Adaptive shape prior in graph cut segmentation. In: 2010 IEEE International Conference on Image Pro- cessing. IEEE, pp 3029–3032 (2010). https://doi.org/10.1109/ICIP.2010.5653335
Veksler, O.: Star shape prior for graph-cut image segmentation. In: European Conference on Computer Vision. Springer, pp 454–467 (2008). https://doi.org/10.1007/978-3-540-88690-7_34
Nosrati, M. S., Hamarneh, G.: Incorporating prior knowledge in medical image segmentation: a survey (2021) [Online]. Available: https://arxiv.org/abs/1607.01092
Lee, M.C.H., Petersen, K., Pawlowski, N., Glocker, B., Schaap, M.: TeTrIS: template transformer networks for image segmentation with shape priors. IEEE Trans. Med. Imaging 38(11), 2596–2606 (2019). https://doi.org/10.1109/TMI.2019.2905990
Ravishankar, H., Venkataramani, R., Thiruvenkadam, S., Sudhakar, P., Vaidya, V.: Learning and incorporating shape models for semantic segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 203–211 (2017). https://doi.org/10.1007/978-3-319-66182-7_24
Avendi, M.R., Kheradvar, A., Jafarkhani, H.: A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac MRI. Med. Image Anal. 30, 108–119 (2016). https://doi.org/10.1016/j.media.2016.01.005
Ngo, T.A., Lu, Z., Carneiro, G.: Combining deep learning and level set for the automated segmentation of the left ventricle of the heart from cardiac cine magnetic resonance. Med. Image Anal. 35, 159–171 (2017). https://doi.org/10.1016/j.media.2016.05.009
Zhao, C., Xu, Y., He, Z., Tang, J., Zhang, Y., Han, J., Shi, Y., Zhou, W.: Lung segmentation and automatic detection of COVID-19 using radiomic features from chest CT images. Pattern Recogn. 119, 108071 (2021). https://doi.org/10.1016/j.patcog.2021.108071
Rosenfeld, A., Pfaltz, J.L.: Sequential operations in digital picture processing. J. ACM (JACM) 13(4), 471–494 (1966). https://doi.org/10.1145/321356.321357
Shih, F.Y., Wu, Y.-T.: Fast Euclidean distance transformation in two scans using a 3x3 neighborhood. Comput. Vis. Image Underst. 93(2), 195–205 (2004). https://doi.org/10.1016/j.cviu.2003.09.004
Acknowledgements
This study was supported by Natural Science Foundation of Zhejiang Province (No. LQ21H190004), China Postdoctoral Science Foundation (No. 2020T130102ZX), Postdoctoral Science Foundation of Zhejiang Province (No. ZJ2020031), the Educational Commission of Zhejiang Province of China (No. Y202147553).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, X., Zhang, Z., Guo, L. et al. FAM: focal attention module for lesion segmentation of COVID-19 CT images. J Real-Time Image Proc 19, 1091–1104 (2022). https://doi.org/10.1007/s11554-022-01249-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11554-022-01249-5