1 Introduction

Image segmentation is one of the main research areas in medical image analysis, which attempt to assign the labels to each pixels and address the pixel-wise lesion recognition [9]. The morphological properties such as shapes, sizes and areas of segmentation outcomes usually provide significant cues for early manifestations of many malignant diseases. The techniques such as computed tomography (CT), magnetic resonance imaging (MRI), microscopy imaging and other imaging modalities, which could provide an intuitive and effective way to scan variant diseases, have been widely utilized in daily clinical diagnosis and treatment planning [36]. Segmentation of different focused objects in these images, for example, skin lesion segmentation in dermoscopy images [15], lung segmentation in CT images [27] and colorectal cancer segmentation in endoscopy images [31], is a fundamental step to extract relevant features accurately for developing computer-aided diagnosis systems (CAD), which could assist professional clinicians by reducing the time, cost and error of manual processing in clinical situation [3].

With the rapid development and wide popularization of medical imaging technologies, a large number of medical images are collected and could be used for analysis. It is emergent to develop automatic algorithms to efficiently and objectively analyze these medical images, with the aim of providing doctors with precise interpretation of diagnosis information contained in the images to have better treatment of a large amounts of patients [3]. However, automatic and accurate segmentation of lesion (tissue or organ) in medical images remains a challenging task. First, the morphological appearance of the focused objects have large variant among different individuals even for a same disease, which will increase the difficulty of segmentation. Figure 1 illustrates three examples for lung nodule, skin lesion and colorectal polyp. Second, the difference between the focus objects and background is unclear that it complicates the segmentation . In particularly, different tissues and organs are always included in the focused areas, which make the segmentation of these confusing boundaries more difficult. Third, texture features, artifacts and imaging noise will also bring great challenge to segmentation.

Fig. 1
figure 1

Examples of several representative medical images. The first row indicates the lung nodule in CT images, the second row represents the skin lesion in the dermoscopy image, and the last row shows the colorectal polyp in endoscopy images

In the past few decades, segmentation of medical images has received much attention; a large amount of accurate and automatic methods for this topic have been proposed [23, 34, 38]. The earlier methods are mainly based on traditional hand-crafted features [5, 7, 12, 26, 39, 41]. According to the types of features, these methods can be roughly divided into three groups of: gray level based, texture based and level set atlas based. Although these methods obtained exciting performance at that time, they are unreliable in the real complex clinical situations, because they heavily depend on pre-processing, which is low robustness to image quality and artifacts. Due to the great success of deep learning (DL) in the field of computer vision, amount variants of DL methods are proposed and applied to medical image segmentation [23, 33, 37]. The representative Unet is the most popular selections and usually obtain good results [30]. The architecture of Unet consists of an encoding path to obtain context features and a decoding path that enables precise localization. It can be trained end to end for very few images. Although many variants of Unet have been proposed and widely used in medical image segmentation, they still suffer from inaccurate object boundaries and unsatisfactory results [1,2,3, 25, 32, 35, 40]. It is well known that the discriminative features heavily affect the segmentation performance. In order to accurately segment the focused objects, many researchers paid close attention to extract and aggregate the high-level context features and low-level fine details simultaneously.

In this manuscript, we propose a contour-aware semantic segmentation network for medical image segmentation, which is an extension of Unet and have two branches: semantic branch and detail branch. The semantic branch follows the classical encoding–decoding structure of Unet, which focuses on extracting semantic features from shallow and deep layers. The detail branch is designed to enhance the detailed contour information implied in the shallow layers. In addition, inspired by the densely connected convolution, we design a MultiBlock module to replace convolutional block of classical Unet in order to utilize different receptive fields. We also add a spatial attention module between the encoding and decoding path to suppress abundant features to improve the network’s representation capability. The contributions of this work are summarized as follows:

  • We propose a two-branch Convolutional Neural Network architecture containing a semantic branch and a detail branch, which aggregates the high-level semantic features and low-level fine details simultaneously.

  • We design a MulBlock module which utilizes three paths with different receptive fields to extract semantic information from the input feature maps.

  • In comparison with the state-of-the-art methods, the proposed method achieves a remarkable performance on four public medical image segmentation challenges.

The remainder of this manuscript is organized as follows: Sect. 2 reviews relevant works of medical image segmentation; Sect. 3 presents our proposed segmentation neural network; Sect. 4 presents the datasets and experimental results; Sect. 5 concludes this paper.

2 Related work

Semantic segmentation is one of the most crucial tasks in the field of medical imaging analysis. Prior literatures mainly utilize the traditional handcrafted features for semantic segmentation. With the fast development of deep learning, DL-based methods have achieved outstanding results and dominated this task. Among these methods, the convolutional neural networks (CNN) and fully convolutional network (FCN) are popular segmentation frameworks. In this section, we briefly review CNN-based and FCN-based methods for medical imaging segmentation.

2.1 CNN-based segmentation frameworks

To our knowledge, Ciresan et al. [13] first utilized a deep CNN to segment electron microscopy images. They classified each pixel of every slice through extracting a patch around the pixel by a sliding window. Because the sliding window method has plenty of overlap and redundant computation, this method is time inefficiency. Pereira et al. [29] utilized intensity normalization as pre-processing step and small kernels to improve deeper architecture of CNN, to reduce over-fitting for brain tumors segmentation in magnetic resonance images (MRI). In order to extract context information, Chen et al. [8] proposed a segmentation framework named Deeplab, which use the convolutional layers to replace all fully connected layers and atrous convolutional layers for increasing the feature resolution. Choudhury et al. [10] attempted to utilize the DeepLab network for the task of brain tumor segmentation in MRI. Based on residual learning, Li et al. [21] proposed a dense deconvolutional network for skin lesion segmentation, which combined global contextual information by dense deconvolutional layers, chained residual pooling and auxiliary supervision, to obtain multi-scale features.

2.2 FCN-based segmentation frameworks

The CNN-based methods use patch around the pixel to make a patch-wise prediction that they ignore the spatial information implied in the image when the convolutional features are fed into the fully connected (fc) layers [3]. In order to overcome this problem, Long et al. [24] proposed a fully convolutional network (FCN) which use convolutional and deconvolutional layers to replace all fc layers in CNN architecture. FCN is trained end to end and pixels to pixels for semantic segmentation. It is the most popular method utilized to segment medical and biomedical images. Christ et al. [11] proposed a liver segmentation method by cascading two FCNs, where the first FCN performs segmentation to predict the region of interest (ROI) of liver; the second FCN focuses on segmenting the liver lesions within the predicted ROIs of the first FCN. Zhou et al. [42] proposed a Focal FCN which applies the focal loss on a fully weighted FCN in medical image segmentation, with the aim of addressing the limited training data of small object by adding weights to the background and foreground loss.

Unet [30] has become one of most popular FCNs, which has been widely used in biomedical image segmentation since 2015. Unet utilizes an encoder-decoder architecture. The encoding path consists of several convolutions and pooling operation, and the decoding path utilizes up-sampling operation to restore the shape of original image and produce segmentation results. The shortcut connections between layers of equal resolution make Unet able to utilize the global location and context information at same time. In addition, it works well on limited training images [4]. Okatay et al. [28] proposed an attention Unet which employs a novel attention gate, to automatically focus on target structures of varying shapes and sizes. In order to utilize the strengths of different network. Altom et al. [1] proposed Recurrent Convolutional Neural Network (RUnet) and Recurrent Residual Convolutional Neural Network (R2U-Net) based on Unet for medical image segmentation. Azad et al. [3] proposed an extension of Unet , Bi-Directional ConvLSTM Unet with densely connected convolutions (BDCU), which combines the bi-directional ConvLSTM and dense convolution mechanism with Unet for medical image segmentation.

Recently, several researches focused on using the detail information implied in shallow feature maps to enhance the boundary information [18]. For example, Wang et al. [36] proposed a boundary-aware context neural network (BA-Net), which employs pyramid edge extraction module, mini multi-task learning module and interactive attention module, to capture context information for 2D medical image segmentation. Zhou et al. [43] first presented a nested Unet architecture named UNet++, which utilizes a series of nested and dense skip pathways to connect the encoder and decoder subnetworks. Then, they redesigned the skip connections in Unet++ by aggregating features of varying semantic scales in decoders and devised a pruning scheme to accelerate the inference speed of Unet++ [44].

3 Method

The flow chart of the proposed network is illustrated in Fig. 2. Our proposed model is an end-to-end trainable medical image segmentation network, which has two branches: semantic branch to obtain high-level semantic context information and detail branch to enhance low-level detail information. We introduce the details of our method below.

Fig. 2
figure 2

Flowchart of the proposed method. The structure includes: (1) a detail branch with wide channels and shallow layers, used to capture the details of the underlying layer and generate high-resolution feature representations; (2) a semantic branch with narrow channels and deep layers used to get high-level semantic context information

3.1 Semantic branch

The semantic branch is shown in the top part of Fig. 2. This branch has narrow channels and deep layers, with aim of capturing high-level semantic information of the image. The architecture of the semantic branch is an extension of Unet, which also uses the encoding–decoding structure. The difference is that we employ MultiBlock module and spatial attention module to replace the original convolutional filters in classical Unet. The encoding path consists of five steps. The first step contains a convolutional \(1\times 1\) layer and a MultiBlock module. Each of the next four steps is composed of a MultiBlock module. After MultiBlock module in each step, there is a \(2 \times 2\) down-sampling layer for max pooling. The resolutions of the outputs of each layer in encoding path are \(256\times 256\), \(128\times 128\), \(64\times 64\), \(32\times 32\), \(16\times 16\), respectively. Then, the encoding path is followed by a spatial attention module to suppress the useless redundant features. The number of channels in each layer of the semantic branch is listed in Table 1. The decoding path has four steps. Each step concatenates the layer obtained by performing an up-sampling function over the output of the previous layer and the layer copied who has the same resolution from encoding path.

Table 1 Number of channels in each layer of the semantic branch

3.2 MultiBlock module

The details of MultiBlock module are illustrated in Fig. 3. Densely connected convolution [19] was proposed to mitigate the problem that the sequence of convolutional layers in original Unet might learn redundant features in the successive convolutions. It utilizes the idea of ”collective knowledge” by allowing information flow

through the network and reuse the feature maps. We employ the idea of densely connected convolution and propose a variant named MultiBlock in our network. Different from densely connected convolution, MultiBlock module reduces the channel of the original main branch to half for cutdown the size of the model. In addition, a new path with two \(3\times 3\) convolutional filters is added for expanding the receptive field of the module. Let us assume the number of input channels is 4k, as shown in Fig. 3, the channels of left path is k and the channels of right path also is k. Then, they are concatenated with input maps, and we can get the output of MultiBlock with 6k channels.

Fig. 3
figure 3

MultiBlock module

Fig. 4
figure 4

Spatial attentive module. The size of each feature map is shown in \( H \times W \times C \), where HWC indicate height, width and number of channels, respectively

3.3 Spatial attention module

Spatial attention module (SAM) was proposed to infer the attention map along the spatial dimension between features [17]. It is commonly used to perform adaptive feature refinement by multiplying the attention map with the input feature map for image segmentation [17]. As shown in Fig. 4, SAM firstly concatenates the feature maps which are obtained by performing average pooling and max pooling along the channel axis, to generate an efficient feature descriptor. Then, the concatenated feature descriptor is fed into a \(7\times 7\) convolutional layer and a sigmoid activation function to produce an attention map. Finally, the attention map is utilized to multiply the input image in order to generate the output feature map.

3.4 Detail branch

In medical images, the differences between the target objects and the background commonly are not obvious, especially that there exist amount jagged contours and tiny objects, the high-level semantic context information can improve the performance of larger structures segmentation, but it is easy to make mistakes when dealing with these boundary structures. It is well known that the shallow feature maps of a deep convolutional network contain abundant boundary information. In order to remarkably enhance the detail information, which implied in the shallow feature map, we design a small span shallow structure for detail branch. As shown in the bottom part of Fig. 2, we firstly pass the input image into a \(1\times 1\) convolutional layer followed by MultiBlock module to get the first layer feature map. After \(2\times 2\) max pooling and MultiBlock module, we get the second layer feature map. Finally, we send the first layer feature map and the feature map which is upsampled from the second layer feature map to concatenate with the semantic branch. The number of channels in each layer of the detail branch is listed in Table 2. It shows that the detail branch utilizes wide channels and shallow layers to enhance the detail information.

Table 2 Number of channels in each layer of the detail branch

4 Experiments and results

4.1 Datasets

We employ four public datasets: The COVID-19 CT Segmentation dataset, CVC-ClinicDB dataset, ISIC2018 dataset and Lung segmentation dataset to verify the effectiveness of the proposed method. The four datasets are described in detail as follows:

4.1.1 The COVID-19 CT Segmentation dataset

The COVID-19 CT Segmentation dataset is the only one open-access CT segmentation dataset for the novel Coronavirus Disease 2019 (COVID-19) [16]. The dataset includes 100 axial CT images which are collected by the Italian Society of Medical and Interventional Radiology from different COVID-19 patients. The CT images are segmented by a radiologist with different labels to identify lung infections. We randomly split the dataset into a training set with 45 images, a validation set with 5 images and a testing set with the remaining 50 images. Since the dataset suffers from a small sample size, we utilize the same strategy described in [16] to augment the training dataset.

4.1.2 CVC-ClinicDB dataset

CVC-ClinicDB dataset [6] is a public fully annotated colonoscopy image dataset, which has been generated from frames of 29 different standard colonoscopy video sequences. The images in this dataset all have a polyp, and the total number of images is 612. Each image has manually annotated ground truth of polyp. The original resolution of images in the dataset is \(288\times 384\). We randomly divide the dataset into three subsets: a training set with 414 images, a validation set with 85 images and a testing set with the rest 113 images.

4.1.3 ISIC 2018 dataset

The ISIC 2018 skin cancer segmentation dataset is published by the International Skin Imaging Collaboration (ISIC) and has become a major benchmark dataset to evaluate the performance of medical image algorithms [14]. The dataset consists of 2594 images with corresponding annotations of localizing lesions on skin images that containing melenoma. The original resolution of images in this dataset is \(700\times 900\). We use the same preprocessing strategy as [1] to process the input images. We also resize the input images to \(256 \times 256\). We follow the same strategy described in [3] to split whole dataset into three subsets: a training set with 1815 images, a validation set with 259 images and a testing set with 520 images.

4.1.4 Lung segmentation dataset

Lung segmentation dataset is released at the Kaggle Data Science Bowl, with aim of developing algorithms that accurately determine when lesions in the lungs are cancerous for the Lung Nodule Analysis competition in 2017 [22]. The dataset contains 2D and 3D CT images with labels annotated by radiologists for lung segmentation. The resolution of each image in this dataset is \(512\times 512\). Since the CT image consists of not only the lung but also other tissues, it is worth to extract the mask of lung and ignore all other tissues. We use same preprocessing strategy described in [3] to extract the surrounding regions, and obtain the lung region which inside the surrounding regions. We train and test the model with extracted lung regions. We randomly split the images into three subsets: a training set with 571 images, a validation set with 143 images and a test set with 307 images.

4.2 Implementation details

We implement our method in python with Keras. We train and evaluate our method on a high performance computer with 35.4816 Tflops CPU and 18.8 Tflops GPU. The standard binary cross-entropy loss is used as training loss. Adam algorithm with initial learning rate of le-4 is used to optimize the model weights. We set the batch size of training with 50, 50, 100 and 50 for four datasets, respectively. We stop the training process when the validation loss does not decrease in 10 consecutive epochs.

4.3 Evaluation metrics

We utilize several common metrics to measure the experiment performance comparisons, such as F1 score (F-measure), accuracy, the area under ROC curve (AUC), sensitivity and specificity. We also use Frame Per Second (FPS), which is the number of images that can be processed per second, to measure the inference speed.

4.4 Ablation study

To verify the impacts of the detail branch, we compare the quantitative results of the proposed method with and without this mechanism. The detailed results are exhibited in Table 3. In Table 3, Ours_noDetail and Ours indicate the methods without and with the detail branch, respectively. From the results, we could see that the detail branch mechanism provides significant improvement for the COVID-19 and CVC-ClinicDB datasets, and slight improvement for the ISIC and Lung datasets.

Table 3 Effects of the detail branch with the semantic branch on four different datasets
Table 4 Effects of the detail branch extended to Unet on four different datasets

In order to verify the proposed detail branch can be extended to other available networks except only limited with the semantic branch for contour-aware segmentation, we extend the detail branch to the Unet [30]. The detailed results are listed in Table 4. In Table 4, Unet_Detail means the method that combines the detail branch into the standard Unet architecture. From the results, we could see that the detail branch mechanism also has a significant improvement for the COVID-19 and CVC-ClinicDB datasets.

From the Ablation studies, we could see that the proposed detail branch could improve the segmentation results especial for the COVID-19 and CVC-ClinicDB datasets. Observing the images in the COVID-19 and CVC-CLinicDB datasets, we find the segmented objects have a high similar with the background. Several literatures consider these two segmentation tasks belong to camouflaged object detection. We attempt to explain this result, but it is difficult. We will verify whether this mechanism has good performance for such problems in our future work.

4.5 Results

In recent years, a large number of state-of-the-art algorithms focused on medical image segmentation are reported, such as Unet [30], Attention Unet [28], R2U-Net [1], BCDU [3], DeepLabv3+ [8] and Unet++ [44]. The experiment performance comparisons of above algorithms with our proposed method are presented in this section. For fairness, the codes of the comparison methods are all download from original websites. It should be noted out that for all four datasets, we use exactly the same network architecture of our method.

Firstly, we test the proposed method on the COVID-19 CT Segmentation dataset. The quantitative results are shown in detail in Table 5. From the results, we could see that the proposed method achieves an exciting performance with 75.16 F1-Score, 91.97 Accuracy and 83.82 AUC, which are significant higher than other methods. To demonstrate the segmentation, we also display the detailed results for 7 random selected images in Fig. 5, which indicates the segmentation results are close to the ground truth with more clear boundaries.

Table 5 Performance comparison of the proposed network and the state-of-the-art methods on COVID-19 dataset
Fig. 5
figure 5

Visual comparisons to different methods on COVID-19 dataset. Red = TP, blue = TN, yellow = FN, and green = FP

Secondly, we test the proposed method on CVC-ClinicDB dataset. The quantitative results obtained by different methods and the proposed network are listed in Table 6. Table 6 shows that the proposed method achieves the best results except the metrics specificity. It obtains the highest F1-score, sensitivity, accuracy, AUC of 79.98, 77.70, 96.53 and 88.04, respectively, and specificity comparable. We also display the detailed results for 6 random selected images in Fig. 6. From Fig. 6, we can see that Unet has the problem of less segmentation and its produced contours are not smooth, Attention Unet is easy to produce over segmentation, and BCDU fails to segment targets with unclear contours. Our method also shows more precise and fine segmentation output than other methods.

Table 6 Performance comparison of the proposed network and the state-of-the-art methods on CVC-ClinicDB dataset
Fig. 6
figure 6

Visual comparisons to different methods on CVC-ClinicDB dataset. Red = TP, blue = TN, yellow = FN, and green = FP

Thirdly, we test the proposed method on ISIC2018 dataset. Table 7 shows the quantitative results achieved by different methods and the proposed network on ISIC dataset. As shown in Table 7, Unet++ achieves the best performance except the evaluation metric specificity. Our method obtains the F1-score, sensitivity, accuracy, AUC of 86.27, 86.28, 92.19, 90.41, respectively, which are slighter lower than Unet++. For clearly displaying the detailed results, we random select 6 images from the testing set and display the segmentation results in Fig. 7. From Fig. 7, we could see that images from ISIC2018 dataset exist unclear contours and heavy noisy information from background. Unet, Attention Unet and BCDU fail to well segment the contours of target, and R2U-Net produces serious over segmentation. Although the metrics of our method is lower than Unet++, the segmentation boundaries are more close to the ground truth than Unet++.

Table 7 Performance comparison of the proposed network and the state-of-the-art methods on ISIC 2018 dataset
Fig. 7
figure 7

Visual comparisons to different methods on ISIC 2018 dataset. Red = TP, blue = TN, yellow = FN, and green = FP

Finally, we test our method on the Lung segmentation dataset. The qualitative results of lung segmentation on testing set are reported in Table 8. From the results presented in Table 8, we can see that the proposed method achieves the best performance for most of evaluation metrics. It obtains the highest F1-score, specificity, Accuracy of 98.68, 99.68, 99.54, respectively, sensitivity and accuracy comparable. To demonstrate segmentation clearly, we also display the detailed results for 6 random selected images in Fig. 8. The results illustrate that Unet is not good at recognizing small target objects and controlling the overall shape of target objects. Attention Unet is prone to produce over segmentation. R2U-Net pays more attention to segmentation of large scale targets. BCDU lacks capability of recognizing contour details of target objects. Compared to these methods, our model exhibits remarkable performance in lung segmentation challenge.

4.6 Limitations and future work

We compare the total number of parameters contained in the proposed method with the above state-of-the-art methods. Table 9 lists the number of parameters of the above methods. From Table 9, we can see that parameters of the proposed method are 107.57M and it is less than other methods. We also compare the training speed (second per epoch) and the inference speed (FPS) of these methods. The results are also listed in Tables 10 and 11. We could see that Unet++ is converge faster, and Unet is faster than its extension methods in most situations. The results also show that the proposed method has less parameters, but its training speed and inference speed are slower than most methods. We attempt to explain this reason, but it is very difficult. We think it would be caused by the MultiBlock modules used in each step of encoding path. These MultiBlock modules add large convolutional filters to expand the receptive field; thus, they need more computations and increase the execution time. How to improve the speed and maintain the segmentation accuracy of this mechanism would be a significant problem.

Table 8 Performance comparison of the proposed network and the state-of-the-art methods on Lung segmentation dataset
Fig. 8
figure 8

Visual comparisons to different methods on Lung segmentation dataset. Red = TP, blue = TN, yellow = FN, and green = FP

It is well known that the segmentation performance of a deep neural network heavily relies on the characteristics of the training sets. The experiments show that the presented method could be generalized across these four imaging modalities. However, although the boundaries between the segmented objects and background in these images are not obvious, the segmented objects are solitary without overlapping and adhesion. Segmentation of overlapping and adhesion objects is one of the most challenging problems which widely exist in medical image segmentation. We attempted to apply our method to segment nuclei on MoNuSeg dataset [20], which is a collection of Hematoxylin–Eosin (H&E) stained tissue images and exist large nuclear overlapping and adhesion. The results show that the proposed method fails to solve the problem of overlapping and adhesion as well as the comparable methods. The reasons would be that the lack of boundaries in the regions of overlapping and adhesion would affect the extraction of the contour information, which yields the method fails to recognize these objects. How to improve the capability of dealing with overlapping and adhesion would be one of our future works.

Table 9 Number of parameters of different methods
Table 10 Training speed (second per epoch) of different methods on four datasets
Table 11 Inference speed (FPS) of different methods on four datasets

5 Conclusion

Medical image segmentation is a critical and important step for developing computer aided system in clinical situations. In this paper, we proposed an accurate algorithm for medical image segmentation which includes a semantic branch and a detail branch to extract the semantic and detail information, respectively. Inspired by the densely connected convolution, we design a MultiBlock module to replace convolutional block of classical Unet in order to utilize different receptive fields. We also add a spatial attention module between the encoding and decoding path to suppress abundant features to improve the network’s representation capability. In medical images, the differences between the target objects and the background commonly are not obvious, especially that there exist amount jagged contours and tiny objects; the high-level context information hardly well recognizes these objects. Therefore, we design a detail branch to enhance the detailed contour information implied in the shallow layers for improving the representation capability. In comparison with the state-of-the-art methods, our method achieves a remarkable performance on four public medical image segmentation challenges.