1 Introduction

According to the report from the Center for Systems Science and Engineering (CSSE) of Johns Hopkins University, until May 25, 2022, COVID-19 has resulted in 526,824,747 infections, of which, 6,280,794 deaths. Rapid detection of the infection is essential to prompt isolation and treatment of the patients. At present, reverse transcription-polymerase chain reaction (RT-PCR) is the most widely adopted method for COVID-19 diagnosis. However, RT-PCR suffers from some drawbacks such as time consuming, false negative caused by the sampling quality [1]. The chest computed tomography (CT) images captured from COVID-19 patients frequently include patchy bilateral shadows or ground-glass opacity in the lung [2], hence chest CT is adopted as an dominant method for the diagnosis of COVID-19 [1, 3]. Compared with RT-PCR, chest CT image is easy to obtain in the clinical practice, therefore it can be used for the severity classification of COVID-19 patients, for which, contouring the lesion is an essential procedure. The traditional manual contouring is tedious, time-consuming, and heavily depending on doctor’s clinical experience, therefore, there is an urgent need for an automated lesion segmentation method specially designed for COVID-19 CT images.

Nowadays, convolution neural network (CNN), a typical deep learning method, is becoming an essential for the segmentation of COVID-19 CT image. Widely used CNNs include FCN [4], SegNet [5], UNet series (UNet [6], UNet++ [7], UNet3+ [8], etc.), and DeepLab series [9,10,11]. These methods usually rely on a large-scale dataset with high-quality pixel-level annotation of COVID-19 lesions. The need for large-scale data collection and data labeling before the model training prevents it from wide adoption in the context of public health.

Attention mechanism is a technology widely used in the fields such as natural language processing (NLP), statistical learning, image processing, voice recognition. It stems from the particularly selective attention of human vision. Attention mechanism focuses on important information with high weights, ignores unrelated information with low weights, and continuously adjusts weights so that important information can be selected in different situations, thus making it expandable and robust. With the consideration of application contexts, attention can be grouped into spatial, channel, layer, mixed, and temporal domains. Spatial and channel domains are most widely used in the tasks of image processing. Many excellent attention modules such as SENet [12], CBAM [13], SKNet [14] are proposed. To improve the accuracy of deep learning models for COVID-19 lesion segmentation, attention modules (or their variations) have been integrated into the state-of-the-art segmentation networks. However, existing attention modules always cannot fully utilize the characteristics of CT images. Moreover, they always disrupt the original feature distribution of the input data, resulting in low segmentation accuracy and the inefficiency of the network training convergence.

To address the abovementioned issue, we propose a novel design, in which the lesion in CT image is treated as rough spatial attention and then combined with a channel attention module to achieve a novel plug-and-play attention module, (named focal attention module (FAM)) for lesion segmentation of COVID-19 CT images. The main contributions of this study are as follows:

  1. 1

    A novel spatial attention module is proposed. It introduces the shape prior information of the lesion region to improve the feature analysis and weight redistribution of the attention module and accelerates the convergence of the network training.

  2. 2

    We sequentially combine the spatial attention module in the form of the residual block with the channel attention module, constructing a novel Focal Attention Module (FAM) for lesion segmentation of COVID-19 CT images.

  3. 3

    FAM is integrated into six state-of-the-art networks and is validated on the public COVID-19 CT image dataset.

The rest of the paper is organized as follows: Sect. 2 describes the work related to the proposed method. Section 3 details the design and implementation of the proposed method. The experiment and discussion is described in Sect. 4. Finally, we conclude the study in Section.

2 Related work

In this section, we first discuss these existing deep learning-based methods for lesion segmentation of COVID-19 CT images, followed by related work on the attention mechanism, in the end we introduce the applications of shape priors in image segmentation.

2.1 Lesion segmentation of COVID-19 CT images

The data annotation is usually with labor cost and time-consuming, large-scale segmentation datasets of COVID-19 lesions are rarely available. Meanwhile, training networks on a small-scale dataset suffers from the issues such as over-fitting and poor generalization performance. Existing deep learning methods are proposed to attenuate these models’ reliance on a large-scale dataset. The attention mechanism is used to enhance the capability of feature extraction of the network. For example, Fan et al. [15] combined a semi-supervised learning model and FCN8s network with implicit reverse attention and explicit edge attention mechanism to achieve a novel model. It achieves a sensitivity of 72.5% and an accuracy of 96.0%. Chen et al. [16] proposed a residual attention UNet and applied a soft attention mechanism to enhance the capability of feature learning of the model. The proposed model achieves a performance with a segmentation accuracy of 89%. Zhao et al. [17] integrated their proposed spatial-wise and channel-wise attention modules on UNet++ [7]. The Dice Similarity Coefficients (DSC) of the model is 88.99%. A number of novel loss functions and special network modules are also proposed. For example, Wang et al. [18] proposed noise-robust dice loss to solve the problem of poor training results caused by low-quality labels, and the DSC of the model is 80.72%. Inspired by contrast enhancement methods and Atrous Spatial Pyramid Pooling (ASPP) [10], Yan et al. [19] proposed a novel Progressive Atrous Spatial Pyramid Pooling (PASPP) module to progressively aggregate information and obtain more useful contextual features, and the DSC of the model is 72.60%. Elharrouss et al. [20] proposed a multi-class segmentation network based on an encoder-decoder structure, and the multi-input stream of the network allows the model to learn more features. It achieves a sensitivity of 71.1%. In addition, multi-scale features fusion [21], multipoint supervised training [22], and conditional generation model [23] are promising for improving the segmentation accuracy of COVID-19 lesions.

2.2 Attention mechanism

Attention is an essential and complex cognitive function in the human brain [24]. With attention, people can work methodically while receiving a large amount of information through vision, hearing, touch, etc. The human brain can select small portions of interested information from these large amounts of input information to focus on, meanwhile ignoring other portions.

In the context of computer vision, attentions can be divided into soft attention and hard attention [25]. For soft attention, by calculating the attention weight, all data is included in the attention range, and no filter condition for the data feature is set. Hard attention sets the filtering condition after calculating the attention weight and forms a part of the attention weight value that does not meet the condition to 0. Contrarily, soft attention is probabilistic, end-to-end differentiable, and utilizes back-propagation and forward-propagation to learn the attention weight without the posterior sampling. There are a number of studies regarding the soft attention. Inspired by translation and rotation without deformation of the pooling mechanism, Jaderberg et al. [26] proposed a spatial transformation module that could learn the transformation from the network. It was widely used for Optical Character Recognition (OCR). Hu et al. [12] proposed a channel attention model (SENet), but SENet cannot capture spatial contextual information. Woo et al. [13] expanded the SENet and proposed an attention module (CBAM) to constrain and enhance the input feature map from the channel and spatial dimensions. But, the spatial attention module of CBAM fails to capture information at different scales and is not able to establish a long-range dependency. Inspired by the classical non-local means method [27] for image processing, Wang et al. [28] proposed an attention module (non-local neural networks) for capturing long-range dependencies. Fu et al. [29] amalgamated the advantages of CBAM and Non-local Neural Networks to propose the DANet, an attention module widely used in semantic segmentation. Drawing on the idea of residual networks, Wang et al. [30] proposed a novel solution to solve the problem of information reduction caused by stacked attention modules. A Criss-Cross Attention is proposed by Huang et al. [31], to reduce the calculations of Non-local Neural Networks. Gao et al. [32] proposed a Spatially Modulated Co-Attention (SMCA) mechanism to accelerate training convergence, but it suffers from the increased time of computation and inference. A particular channel attention module [33] was proposed to distinguish the esophagus and surrounding tissues from esophageal cancer. However, there are limited literatures regarding the hard attention, and studies [34,35,36] argued that reinforcement learning is required for training in hard attention due to its non-differentiability.

Although there are a number of studies [16, 17, 37, 38] introducing the attention mechanism for lesion segmentation on COVID-19 CT images, the improvement in the performance and accuracy of these models is still urgently expected in academia and industry.

2.3 Shape priors in image segmentation

Traditional segmentation methods (e.g., thresholding, watershed, and region growing) usually suffer from the lack of robustness and poor segmentation accuracy due to the noise, low contrast, and complexity of objects in medical images. Recently, the rapid development of deep learning methods promoted the adoption of deep learning-based image segmentation algorithms in medical image segmentation. Studies [39,40,41,42,43] have shown that integrating prior knowledge of objects into rigorous segmentation formulas can improve the segmentation accuracy of a specific target. The prior knowledge has been utilized in various forms, e.g., user interaction, object shape and appearance [44].

The shape is one of the most important geometric attributes of anatomical objects, and shape priors can reduce the search space of the potential segmentation outputs for deep learning models [45]. Ravishankar et al. [46] incorporated the shape model explicitly in FCN through a novel loss function that penalizes the deviation of the predicted segmentation mask from a learned shape model. Avanti et al. [47] used stacked automatic encoders to infer the target shape, then the inferred shape is incorporated into deformable models to improve the accuracy and robustness. In addition, Ngo et al. [48] and Cremers et al. [40] combined level set and genetic algorithms with deep learning to improve the training effect of the model on small datasets. Zhao et al. [49] obtained the shape prior of the lung region through threshold segmentation to optimize the segmentation of the lung.

3 Proposed method

3.1 Design rationale of focal attention module

Given an intermediate feature map \(\mathbf{F}\in {\mathbb{R}}^{C\times H\times W}\), and an input CT image \(\mathbf{I}\in {\mathbb{R}}^{1\times H\times W}\) is defined as the input. FAM sequentially infers a 1D channel attention map \({{\varvec{M}}}_{{\varvec{c}}}\in {\mathbb{R}}^{C\times 1\times 1}\) and a 2D spatial attention map \({\mathbf{M}}_{s}\in {\mathbb{R}}^{1\times H\times W}\) as illustrated in Fig. 1. The attention module is formularized as:

$${\mathbf{F}}^{\prime } = {\mathbf{M}}_{{\text{c}}} \left( {\mathbf{F}} \right) \otimes {\mathbf{F}},$$
(1)
$$\begin{array}{*{20}c} {{\mathbf{F}}^{\prime \prime } = \left( {{\mathbf{M}}_{{\text{s}}} \left( {\mathbf{I}} \right) + 1} \right) \otimes {\mathbf{F}}^{\prime } } \\ \end{array} ,$$
(2)

where \(\otimes\) denotes element-wise multiplication. During multiplication, the attention values are broadcasted accordingly: channel attention values are broadcasted along the channel dimension, and vice versa. \({\mathbf{F}}^{^{\prime\prime} }\) is the final refined output. Different from naive stacking attention modules (e.g., CBAM, SENet), the feature map refined by the spatial attention module (as depicted in Fig. 2) is combined as a residual branch into the feature map refined by the channel attention module due to the following analysis:

Fig. 1
figure 1

The overview of focal attention module, which consists of channel attention module and spatial attention module

Fig. 2
figure 2

a The channel attention module: max-pooling, average-pooling outputs, and a multi-layer perceptron; b the spatial attention module obtains a rough shape prior of the lesion region by median filtering and distance transformation

  1. 1.

    When the input \(\mathbf{I}\) is a negative sample, spatial attention obtained from its lung image \(\mathbf{L}\) by distance transformation contains less feature information. In this case, stacking will degrade the value of features in deep layers.

  2. 2.

    Residual branch works as feature selectors, which enhance good features and suppress noise from trunk features.

  3. 3.

    Inspired by Residual Attention Network [38], attention residual learning not only keeps good properties of original features but also allows to be refined by the spatial attention module.

3.2 Channel attention module

The channel attention module focuses on “what” is meaningful given the feature maps. To compute the channel attention efficiently, spatial information of a feature map is first aggregated by average-pooling and max-pooling operations, respectively, thus two different spatial context descriptors (i.e. \({\mathbf{F}}_{\text{avg }}^{\mathrm{c}}\) and \({\mathbf{F}}_{max}^{\mathrm{c}}\)) are obtained. Both descriptors are then forwarded to a multi-layer perceptron (MLP) with one hidden layer, achieving two output feature maps. Finally, the output feature maps are merged using element-wise summation. To reduce the number of parameters, the hidden layer size is set to \({\mathbb{R}}^{C/r\times 1\times 1}\), where \(r\) is the reduction ratio. The channel attention is formularized as:

$$\begin{aligned} {\mathbf{M}}_{c} \left( {\mathbf{F}} \right) & = \sigma \left( {{\text{MLP}}\left( {{\text{AvgPool}}\left( {\mathbf{F}} \right)} \right) + {\text{MLP}}\left( {{\mathbf{MaxPool}}\left( {\mathbf{F}} \right)} \right)} \right) \\ & = \sigma \left( {{\varvec{W}}_{1} \left( {{\varvec{W}}_{0} \left( {{\mathbf{F}}_{{\text{avg }}}^{c} } \right)} \right) + {\varvec{W}}_{1} \left( {{\varvec{W}}_{0} \left( {{\mathbf{F}}_{{\text{avg }}}^{c} } \right)} \right)} \right), \\ \end{aligned}$$
(3)

where \(\sigma\) represents the sigmoid function, \({\mathbf{W}}_{0}\in {\mathbb{R}}^{C/r\times C}\) and \({\mathbf{W}}_{1}\in {\mathbb{R}}^{C\times C/r}\). \({\mathbf{W}}_{0}\) and \({\mathbf{W}}_{1}\) are the weights of the MLP and are shared for both inputs. The ReLU activation function is followed by \({\mathbf{W}}_{0}\).

3.3 Spatial attention module

The spatial attention focuses on “where’” is the interested in given feature maps. The spatial attention in CBAM is learned from a 7 × 7 convolution layer, and it has two shortcomings when dealing with the specific task of lesion segmentation for COVID-19 diagnosis: (1) low efficiency on learning process; (2) the change of spatial attention to the feature space easily cause the problematic convergence of training and poor network generalization performance, especially when the dataset is small or the parameters of the backbone network are few. Two mechanisms are introduced to address these issues: (1) adopting the residual structure and refined the feature maps with spatial attention while preserving trunk network features (as shown in Fig. 1); (2) utilizing the shape prior of the COVID-19 lesion region to reduce the search space of the spatial attention module. The main steps (as shown in Fig. 2b) while computing the spatial attention include lung segmentation, median filtering, and distance transformation.

3.3.1 Lung segmentation

To efficiently obtain the shape prior of the lesion region, the lung needs to be segmented from CT images. Currently, many excellent methods of lung segmentation have been proposed and widely used. These methods are mainly divided into three types: traditional image processing-based algorithms, deep learning-based algorithms and the combination of the two former methods. Because segmentation of the lung is not the focus of this paper, the lung region is segmented with a simple mask operation from labels in the dataset.

3.3.2 Median filtering

Median filtering is introduced to eliminate partial noise pixels consisting of the pulmonary trachea and pulmonary vessels from the lung image. Median filtering, a nonlinear method that can preserve the details of the edges of an image while eliminating noise, has been widely used in fields such as image enhancement and image recovery. As shown in Fig. 3a, a few noise pixels in the lung region, such as regions of the tiny pulmonary trachea and pulmonary vessels, interfere significantly with the accurate segmentation of lesions. As shown in Fig. 3b, the median filtering eliminates most of the small pulmonary trachea and pulmonary vessels. For the large pulmonary trachea and pulmonary vessels, the median filtering also reduces their pixel region. Meanwhile, median filtering retains the nature of the lesions (i.e. ground-glass opacity) with little reduction in the area of the lesions due to its large pixel region.

Fig. 3
figure 3

The process of eliminating noise pixels in the lung region of CT image step by step. a A lot of noise pixels (i.e. pulmonary trachea and pulmonary vessels inside the lung region). b Applying median filtering to partially eliminate the noise pixels. c Applying distance transformation to further eliminate the noise pixels and extract the main lesion region

3.3.3 Distance transformation

Distance transformation (DT) is to convert a digital binary image that consists of object and non-object pixels into another image in which each object pixel owns a value corresponding to the minimum distance from the non-object by a distance function [50, 51]. Distance transformation is widely used for target thinning, object skeleton extraction. Euclidean distance, city block distance, and chessboard distance are widely used measures for distance transformation. The full workflow of distance transformation is introduced as follows:

Given an image \(J\), it’s binarized to get an image \({J}_{b}\). In \({J}_{b}\), 1 is associated with object pixel and 0 with the background pixel. Hence, we have a pixel set \(\mathcal{O}\) represented by all the object pixels and \({\mathcal{O}}^{c}\) represented by all the background pixels.

$${\mathcal{O}} = \left\{ {t\left| {J_{b} \left( t \right)} \right. = 1} \right\},$$
(4)
$$\begin{array}{*{20}c} {{\mathcal{O}}^{c} = \left\{ {b\left| {J_{b} \left( b \right)} \right. = 0} \right\}} \\ \end{array} ,$$
(5)

where \(t\) and \(b\) represent the pixel of objects and background respectively. The distance transformation (\(DT\)) generates a map \(D\), in which the value of each pixel in \(\mathcal{O}\) is the smallest distance from this pixel to \({\mathcal{O}}^{c}\):

$$D\left( t \right) = \min \left\{ {d\left( {t,b} \right)\left| {t \in {\mathcal{O}}} \right.,\quad b \in {\mathcal{O}}^{c} } \right\},$$
(6)

where the image \(D\) is called the distance map of \(J\). It is assumed that \(\mathcal{O}\) contains at least one pixel. Otherwise, the output of the \(DT\) is undefined, i.e., the outliers will be ignored in the distance transformation. Moreover, \(d(t,b)\) represents Euclidean distance, is formularized as:

$$\begin{gathered} d\left( {t,b} \right) = \sqrt {\left( {t_{x} - b_{x} } \right)^{2} + \left( {t_{y} - b_{y} } \right)^{2} } \hfill \\ \begin{array}{*{20}c} {0 \le x < W,\quad 0 \le y < H,} \\ \end{array} \hfill \\ \end{gathered}$$
(7)

where \(H\) and \(W\) represent the height and width of the image \(J\) respectively.

As shown in Fig. 3c, distance transformation is used to eliminate noise pixels (i.e., the pulmonary trachea and pulmonary vessels) and extract the main lesion region. By applying sequential median filtering and distance transformation, the distribution of connected regions in a lung image is shown in Fig. 4. The connected regions containing more than 200 pixels represent the lesion region, and those with a small area represent the region of the pulmonary trachea and pulmonary vessels. By applying distance transformation, the distribution of connected regions in the lung image is close to the ground truth. Distance maps of several lung images are shown in Fig. 5. By comparing the distance map with the corresponding lesion label, the main lesion region is extracted.

Fig. 4
figure 4

The distributions of connected regions in a lung image without/with applying sequential median filtering, with applying distance transformation

Fig. 5
figure 5

Distance maps of several lung images: a lung region is segmented from CT images in the dataset. b Distance maps of lung images obtained by distance transformation. c By comparing the distance map with the corresponding lesion label, the main lesion region is extracted

Normalization serves as an activation function to obtain the shape prior of the lesion region. As shown in Fig. 6b, distance transformation strengthens the weight of the main connected regions and weakens the weight of the edge parts (as shown in Fig. 6a). Furthermore, as shown in Fig. 6c, normalization removes both the edge and connected regions with a small area. The normalization function \(Norm\) is formularized as:

$${\text{Norm}}\left( X \right) = \frac{{X - X_{\min } }}{{X_{\max } - X_{\min } }},$$
(8)

where \(X\) represents an image matrix, \({X}_{\mathrm{min}}\) and \({X}_{\mathrm{max}}\) represent the minimum and maximum values in \(X\) respectively. Next, as illustrated in Fig. 2b, two 7 × 7 convolution layers are utilized to learn the attention weight of edge pixels which has a low boundary contrast to surroundings, as well as adaptively tuning up attention weights to other regions. Finally, refined spatial attention is obtained. Spatial attention is formularized as:

$${\mathbf{M}}_{s} \left( {\mathbf{I}} \right) = \sigma \left( {f_{7 \times 7}^{{{\text{cov}}}} \left( {f_{7 \times 7}^{{{\text{cov}}}} \left( {{\text{Norm}}\left( {{\text{DT}}\left( {f^{{\text{med }}} \left( {f^{{{\text{seg}}}} \left( {\mathbf{I}} \right)} \right)} \right)} \right)} \right)} \right)} \right),$$
(9)

where \(\mathbf{I}\) is the input CT image, \(\sigma\) represents the sigmoid function, \({f}^{\mathrm{seg}}\) and \({f}^{\text{med}}\) represent the lung segmentation network and median filtering, respectively. \({f}_{7*7}^{\mathrm{cov}}\) is the 7 × 7 convolution layer. \(\mathrm{DT}\) represents the distance transformation.

Fig. 6
figure 6

A numerical example of distance transformation: a a binary image containing several connected regions; b the distance map of (a); c is the normalization of (b)

4 Experiment

4.1 Dataset and metrics

The dataset contains 20 groups of labeled CT scans of COVID-19 patients. Two radiologists manually labeled the regions of left lung, right lung, and COVID-19 lesions on the images and an extra verification of the labeled regions was carried out by another radiologist. CT slices are exported from the CT scans as 2D images (752 × 670 pixels), and a group of processed images are listed in Fig. 7. The dataset contains a total of 3520 CT images and is divided into a training set and a testing set at a ratio of 4:1 (as detailed in Table 1). It is noted that “positive” means CT images with the lesion, and “negative” means healthy CT images.

Fig. 7
figure 7

The dataset contains segmentation labels for the left lung, right lung and COVID-19 lesions. The lung region is segmented based on the lung label

Table 1 Details of the dataset

The Dice Similarity Coefficients (DSC), false negatives (FN), false positives (FP) and inference time of the networks are adopted as evaluation metrics. DSC is a standard metric for comparing the pixel-wise results between the ground truth and predicted segmentation. It is formularized as follows:

$${\text{DSC}}\left( {A,B} \right) = \frac{{2\left| {A \cap B} \right|}}{\left| A \right| + \left| B \right|},$$
(10)

where \(A\) is the lesion label, and \(B\) denotes the segmented lesion image.

4.2 Training method

An Adam optimizer with a learning rate of 0.001 is employed to minimize the binary cross-entropy (BCE) loss. The total number of training epochs and batch size is set to 100 and 1, respectively. The network weights are initialized with Kaiming initialization, and the network biases are initialized by 0. Moreover, the positive and negative samples are trained alternately, and the dataset is shuffled in each iteration. The formula of the BCE loss is as follows:

$${\text{Loss}}_{{{\text{BCE}}}} = - \left[ {G\log \left( P \right) + \left( {1 - G} \right)\log \left( {1 - P} \right)} \right],$$
(11)

where \(G\) is the label and \(P\) is the output of the network.

4.3 Ablation analysis

FAM can be conveniently integrated into any CNNs as a lightweight and plug-and-play attention module. As shown in Fig. 8, FAM is integrated into convolution layers to refine the intermediate feature map. FAM and two state-of-the-art attention modules (i.e. SENet [19] and CBAM [20]) are respectively inserted into every two convolution layers of six state-of-the-art segmentation networks [6,7,8,9,10,11, 16, 21] for ablation experiments. All the related networks and modules are reproduced in the framework PyTorch, trained and inferenced on a single NVIDIA GeForce RTX 2080Ti GPU with CUDA v10.2 and cuDNN v7.6.5. The main concerns in this ablation analysis contains: shape prior, time complexity, the performance of the network and convergence rate of the network training.

Fig. 8
figure 8

The structure of the integration of FAM with the network

4.3.1 Shape prior

FAM is constructed based on the shape prior and the spatial attention module of CBAM. In the case that CBAM and FAM are integrated respectively between each convolution layer in SegNet (as shown in Fig. 9). A visualization of the spatial attention is built in Fig. 10 to illustrate the effect of shape prior. Since that SegNet is a typical encode-decode structure, in which, low-level features and high-level semantic features are extracted accordingly, the feature map contained in the middle layers is abstract. CBAM adopts such a feature extraction workflow because of the connection between its spatial attention module and the network, i.e., the input of the spatial attention module comes from inside the network rather than outside. Thus, the spatial attention of CBAM integrated into the middle layers is abstract, and that integrated into other layers highlights only the lung region rather than the lesion region, resulting in the inaccuracy of the spatial attention learning of CBAM. The shape prior information of the lesion region is introduced in the spatial attention module to reduce the search space. In summary, FAM focuses on the lesion region without being disturbed by the no-lesion region during the spatial attention learning process.

Fig. 9
figure 9

The workflow of attention modules integrated with SegNet

Fig. 10
figure 10

Spatial attention comparison between CBAM and FAM in SegNet. We visualized the spatial attention for sixteen attention modules of SegNet when using CBAM and FAM. c is the lung image segmented from the input CT image. d and e are corresponding lesion label and shape prior

4.3.2 Time complexity

Time complexity determines the training and inference time of the network. The network with high time complexity suffer from poor real-time performance. Floating Point Operations (FLOPs) is a classical metric of the time complexity. The time complexity of a network is formularized as follows:

$$T = O\left( {\sum\limits_{i = 1}^{l} {f_{{{\text{FLOPs}}}} \left( {{\text{layer}}} \right)} } \right),$$
(12)

where \(l\) denotes the number of layers in the network. \({f}_{FLOPs}\) represents the function that calculates the FLOPs of a layer. FLOPs for each type of layer is formularized as:

$$f_{{{\text{FLOPs}}}}^{{{\text{conv}}}} = 2C_{{{\text{in}}}} C_{{{\text{out}}}} K_{w} K_{h} Q_{w} Q_{h} ,$$
(13)
$$f_{{{\text{FLOPs}}}}^{{{\text{linear}}}} = 2N_{{{\text{in}}}} N_{{{\text{out}}}} ,$$
(14)
$$f_{{{\text{FLOPs}}}}^{{{\text{pooling}}}} = C_{in} Q_{w} Q_{h} ,$$
(15)
$$f_{{{\text{FLOPs}}}}^{{{\text{relu}}}} = C_{in} Q_{w} Q_{h} ,$$
(16)
$$f_{{{\text{FLOPs}}}}^{{{\text{sigmoid}}}} = 4C_{in} Q_{w} Q_{h} ,$$
(17)

where \({f}_{\mathrm{FLOPs}}^{\mathrm{conv}}\), \({f}_{\mathrm{FLOPs}}^{\mathrm{linear}}\), \({f}_{\mathrm{FLOPs}}^{\mathrm{pooling}}\), \({f}_{\mathrm{FLOPs}}^{\mathrm{relu}}\) and \({f}_{\mathrm{FLOPs}}^{\mathrm{sigmoid}}\) represent the function that calculate the FLOPs of convolution layer, full connect layer, global pooling layer, Relu layer and sigmoid layer respectively. These five types of layers are used in FAM, SENet and CBAM. \({C}_{in}\) and \({C}_{out}\) denote the channel number of the input and output feature map respectively. \({K}_{w}\) and \({K}_{h}\) denote the width and height of the convolution kernel respectively. \({Q}_{w}\) and \({Q}_{h}\) denote the width and height of the input feature map respectively. \({N}_{\mathrm{in}}\) and \({N}_{\mathrm{out}}\) denote the number of input and output neurons. We set the channel number of the input feature map to 16 and set the reduction ratio of the channel attention module to 16. The size of the input feature map is 652 × 752. FLOPs of FAM, SENet and CBAM are listed in Table 2. SENet achieves the smallest FLOPs because it lacks a spatial attention module. FAM and CBAM share a similar structure, but FLOPs of the latter is less than the former by about 15 million. It is because FAM owns fewer pooling layers than CBAM. As shown in Table. 3, FLOPs of six state-of-the-art segmentation networks increase very little when FAM is integrated between every two convolution layers. Moreover, the inference time of FAM is 4 ms faster than that for CBAM on average. In conclusion, FAM is able to greatly improve the performance of the network with very little time complexity increasement.

Table 2 FLOPs of FAM, SENet and CBAM
Table 3 Performance comparison of FAM, SENet and CBAM on state-of-the-art segmentation network

4.3.3 Performance analysis of the networks with attention modules

The average DSC, FN and FP of all images in the test set are adopted to analyze the performance of the integration of attention modules to the networks (as detailed in Table 3). FAM adds almost no extra parameters to the network, as it only has one more 7 × 7 convolution than CBAM. The integration of FAM improves the DSC of all these networks, with an improvement of 2% for SegNet. In addition, the integration of FAM reduces the FN and FP, with a reduction of 17.6% for PSPNet.

SENet improves the performance of UNet, PSPNet and UNet++, meanwhile it degrades the performance of FCN, SegNet and deeplabV3+. CBAM improves the performance of SegNet and PSPNet, meanwhile it degrades the performance of FCN, UNet, Deeplabv3 and UNet++. This finding shows that the effectiveness of the integration of the attention module depends on the structure of the network.

Figure 11. shows the segmentation result of SegNet integrated with different attention modules on the dataset. As shown in Fig. 11, SegNet without the integration of the attention module suffers from false detection in lesions of both left and right lobes to some extent. The integration of SENet alleviates it but CBAM exacerbates it. Although the SegNet integrated with FAM has many false detections, overall it has the highest segmentation accuracy.

Fig. 11
figure 11

The comparison of the segmentation results on lesions of COVID-19 CT images by applying various combinations of SegNet and attention modules

4.3.4 Convergence analysis of the network training

Six networks with attention modules are trained with the same experimental settings and dataset. As shown in Fig. 12, all networks converge within 60 iterations. FAM accelerates model training better than SegNet or CBAM does. For SegNet, UNet++ and DeepLabV3+, SENet does not significantly accelerate the model training. For UNet and UNet++, CBAM decelerates the convergence of the model training. However, FAM accelerates the training of FCN, UNet, SegNet, PSPNet and DeepLabV3+, as well as minimizing the loss.

Fig. 12
figure 12

Diagram of training networks integrated with various attention modules. ac and df show that FAM accelerates the network training as well as minimizes the loss

For UNet++, FAM performs better than the other two attention modules. For UNet and PSPNet, FAM makes the training process more stable. The result shows that FAM achieves better convergence rate and less converged loss value of the model training among these six networks than CBAM does. In addition, although SENet only applies its attention in the channel dimension, it achieves better performance than CBAM in specific networks such as FCN, UNet and PSPNet.

5 Conclusion

In this study, a lightweight and plug-and-play attention module, is proposed to improve the lesion segmentation performance of CNNs for COVID-19 CT images. FAM refines the input feature map from channel and space dimensions to maximize the network representation. In the spatial attention of FAM, shape prior of the lesion region is used to reduce the search space for attention learning. In addition, the feature map refined by spatial attention is added to the network as a residual branch. A set of experiments proved that: (1) FAM could improve the segmentation performance on a small-scale public COVID-19 CT image dataset; (2) FAM could accelerate the convergence speed of the model training; (3) FAM is capable of being stacked in a deep segmentation network without performance loss. (4) FAM could achieve better real-time performance.

FAM is promising for practical use in public health. In future, we will work towards improving the generated shape prior to enhance the generalization performance of FAM based on the up-to-date COVID-19 CT image datasets.