An attention-fused network for semantic segmentation of very-high-resolution remote sensing imagery

https://doi.org/10.1016/j.isprsjprs.2021.05.004Get rights and content

Abstract

Semantic segmentation is an essential part of deep learning. In recent years, with the development of remote sensing big data, semantic segmentation has been increasingly used in remote sensing. Deep convolutional neural networks (DCNNs) face the challenge of feature fusion: very-high-resolution remote sensing image multisource data fusion can increase the network’s learnable information, which is conducive to correctly classifying target objects by DCNNs; simultaneously, the fusion of high-level abstract features and low-level spatial features can improve the classification accuracy at the border between target objects. In this paper, we propose a multipath encoder structure to extract features of multipath inputs, a multipath attention-fused block module to fuse multipath features, and a refinement attention-fused block module to fuse high-level abstract features and low-level spatial features. Furthermore, we propose a novel convolutional neural network architecture, named attention-fused network (AFNet). Based on our AFNet, we achieve state-of-the-art performance with an overall accuracy of 91.7% and a mean F1 score of 90.96% on the ISPRS Vaihingen 2D dataset and an overall accuracy of 92.1% and a mean F1 score of 93.44% on the ISPRS Potsdam 2D dataset.

Introduction

In recent years, with the rapid development of remote sensing technology, the amount of remote sensing data that has been obtained has grown significantly (Ma et al., 2015). Remotely sensed big data have 4Vs characteristics, which represent volume, variety, velocity, and veracity (Zhang, 2018, Zhang et al., 2019). We can exploit rich and important information from remotely sensed big data. With the improvement in sensor technology, the spatial resolution of remote sensing images is increasing. In high spatial resolution images, the spatial texture details of target objects are preserved (Trias-Sanz et al., 2008). We can use spatial texture information to identify, classify, and even extract accurate contours to exploit rich geological spatial information contained in images. The higher the spatial resolution is, the larger the volume of data and the richer the information it contains (Carleer et al., 2005). The high-resolution remote sensing imagery’s spatial resolution can reach the meter or decimeter level, while the very-high-resolution imagery can reach the centimeter level. In very-high-resolution images, each target object has rich details. We can distinguish and identify different target objects based on these detailed features. Some target objects must be accurately identified by very-high-resolution images (Benediktsson et al., 2012). In low- and medium-resolution images, some similar target objects are easily confused and difficult to distinguish from each other due to the loss of a large amount of texture information. Therefore, very-high-resolution images can be more accurately used in target object recognition and classification and have an advantage over low- and medium-resolution images.

In recent years, deep learning has been developed by leaps and bounds in the field of computer vision (LeCun et al., 2015). Deep learning is a data-driven technology (Reichstein et al., 2019). With the development of big data, deep learning has significant advantages (Chen and Lin, 2014). Deep learning for image analysis is based on deep convolutional neural networks (DCNNs), building complex spatial texture expression models and exploiting content information in images. Deep learning is widely used in applications, such as scene classification (LeCun et al., 1998, Krizhevsky et al., 2012, Szegedy et al., 2015, Simonyan and Zisserman, 2014), object detection (Girshick et al., 2014, Girshick, 2015, Ren et al., 2015, Liu et al., 2016), and semantic segmentation (Long et al., 2015, Badrinarayanan et al., 2017, Ronneberger et al., 2015, Noh et al., 2015). Among these applications, semantic segmentation is the classification of each pixel in a picture, which is a kind of pixel-level image classification. Since all pixels are classified, the contours of different types of target objects can be accurately extracted. The positions, shapes, and spatial distribution of the target objects are more accurate.

In the field of remote sensing, the typical applications of semantic segmentation are land-use mapping (Castelluccio et al., 2015, Cheng et al., 2015, Hu and Wang, 2013), land-cover mapping (Friedl and Brodley, 1997, Running et al., 1995, Townshend et al., 1991), building extraction (Lefèvre et al., 2007, Vu et al., 2009), waterbody extraction (Zhaohui et al., 2003, Shen and Li, 2010), and so on. Semantic segmentation based on traditional remote sensing methods requires the artificial design of corresponding feature extractors according to the characteristics of different target objects. The artificially designed feature extractors have high professional knowledge requirements (Ball et al., 2017), cannot adapt to complex application scenarios and have limited generalization capabilities. Deep learning-based semantic segmentation can effectively overcome the limitations of traditional remote sensing methods (Zhang et al., 2016). This method can extract rich features and has strong robustness. DCNN learns the feature information of different target objects by itself, thereby achieving pixel-level image classification, and the method has a strong generalization ability.

However, there are also some difficulties in the application of deep learning in the field of remote sensing, and these difficulties are outlined as follows:

  • Images in the field of computer vision are generally RGB three-channel images. However, remote sensing images are composed of multiband data. There are also some other types of remote sensing data, such as the normalized difference vegetation index (NDVI) and digital surface model (DSM). These data are not obtained by optical sensors and have different characteristics from ordinary optical images. The most popular DCNNs work with three-channel RGB optical images. Although those DCNNs can work with single-channel or multichannel images, it is not appropriate if we simply stack the optical data and the other structural data. It is harder to train a network by using one encoder to extract multisource data features than by using individual encoders to learn the individual modalities. Fusing the separate features in the decoder will simplify the training objective. Current fusion methods for the features extracted from multisource data rely on summing the feature maps (Audebert et al., 2018, Audebert et al., 2016) or concatenating individual feature maps (Sherrah, 2016, Marmanis et al., 2018). The effective fusion of the features remains an open research direction.

  • The DCNN is a stack of many convolutional layers and pooling layers. The convolutional layer is used to extract features, and the pooling layer is used to aggregate features. The deeper the network is, the more abstract the extracted information. However, in the pooling layer, a significant amount of spatial information is lost. The shallow part of the network cannot adequately extract abstract information, but the spatial information is kept intact. Semantic segmentation must be able to both extract abstract information and retain more accurate position information to achieve correct pixel-level image classification. The scenes of remote sensing images are very complicated. The effective fusion of low-level spatial features and high-level abstract features is a problem that needs to be further optimized.

In summary, these difficulties include two types of feature fusion: 1) multipath feature fusion extracted from multisource data and 2) multilevel feature fusion for high-level abstract features and low-level spatial features. However, mainstream DCNNs cannot yet efficiently and effectively deal with the problems of feature fusion. In this paper, we propose a novel attention-fused network (AFNet) architecture, including the multipath attention-fused block (MAFB) module and refinement attention-fused block (RAFB) module, which perform well in solving the problems of ”multipath feature fusion” and ”multilevel feature fusion”.

The MAFB module is designed to solve the difficulty of ”multipath feature fusion”. In the task of semantic segmentation for target objects, data from different sources may play a key role. Therefore, to ensure that multipath features extracted from different inputs are treated equally, we use a symmetric structure to feed these features into MAFB. To suppress the interference of useless feature information on the classification results, we introduce an attention structure. We use a channel attention (Hu et al., 2018) module to calculate the feature weights in the channel dimension and obtain the key channel features. We use a spatial attention (Woo et al., 2018) module to calculate the feature weights in the spatial dimension to obtain the key spatial features. The fusion of these two key features completes the selection and fusion of the multipath features.

The RAFB module is designed to solve the difficulty of ”multilevel feature fusion”. We use a channel attention module to calculate the feature weights in the channel dimension from the high-level abstract features and then use the feature weights to select the useful low-level spatial features to improve the abstract expression ability of the low-level spatial features. We use a spatial attention module to calculate the feature weights in the spatial dimension from the low-level spatial features and then use the feature weights to refine the spatial details of the high-level abstract features. Finally, we fuse these two refined features and obtain the fused multilevel features.

In summary, the contributions of this paper are described as follows:

  • Inspired by the channel attention structure and the spatial attention structure, we design a variant spatial attention module. The variant spatial attention module is designed to calculate the feature weights in the spatial dimension and extract useful key spatial features.

  • We design a multipath encoder (MPE) structure to simultaneously extract the abstract features and the spatial features from the different data input sources. We rethink the method of feature fusion in the DCNN and design a multipath attention-fused block (MAFB) module to fuse the multipath features from the MPE structure.

  • We design a refinement attention-fused block (RAFB) module to fuse low-level spatial features and high-level abstract features. According to the characteristics of different level features, the RAFB module makes full use of the advantages of those features.

  • By integrating the MPE structure with the MAFB module and the RAFB module, we propose an attention-fused network (AFNet) to simultaneously address the ”multipath feature fusion” and ”multilevel feature fusion” issues. An overview of the AFNet architecture is shown in Fig. 1. Our proposed AFNet achieves state-of-the-art performances on the ISPRS Vaihingen 2D dataset and the ISPRS Potsdam 2D dataset.1

The remainder of this paper is organized as follows: Section 2 presents the related work. In Section 3, we introduce our proposed methodology about the multipath V-shape network (MPVN), MAFB, RAFB, and AFNet architecture. Section 4 experimentally validates AFNet on the ISPRS Vaihingen 2D dataset and ISPRS Potsdam 2D dataset. In Section 5, we discuss the impact of training parameters on AFNet. Section 6 presents the conclusion of this paper.

Section snippets

Related work

Several popular CNNs developed in recent years, such as AlexNet (Krizhevsky et al., 2012), VGGNet (Simonyan and Zisserman, 2014), GoogLeNet (Szegedy et al., 2015), and ResNet (He et al., 2016), have been used in scene classification. FCN (Long et al., 2015) is the first fully convolutional neural network that is designed for semantic segmentation. FCN uses skip connections to refine feature maps and upsample the output feature maps to the size of the input origin data. However, the abstraction

Methodology

DCNNs rely on encoders to extract features. On the one hand, the feature map is downsampled multiple times in the encoder to save the cost of hardware resources, and on the other hand, it is downsampled multiple times to aggregate feature information and increase the receptive field of the CNN. As the encoder gradually deepens, the feature information becomes increasingly abstract, and the feature expression ability becomes increasingly stronger. However, as shown in Fig. 2, a significant

ISPRS Vaihingen 2D dataset

The ISPRS Vaihingen 2D dataset is a benchmark dataset of aerial remote sensing images labeled by the International Society for Photogrammetry and Remote Sensing (ISPRS). The dataset contains six types of land-cover categories, namely, impervious surfaces (imp_surf), buildings, low vegetation (low_veg), trees, cars, and clutter/background (clutter). The Vaihingen dataset contains aerial remote sensing images taken by drone in Vaihingen town, Germany. As shown in Fig. 10, there are 33 tiles of

Encoder

The MPE module used by AFNet in this paper includes two branches, ResNet-50 and ResNet-18. The main branch uses ResNet-50. Because there are only 16 tiles of images in the training samples of the ISPRS Vaihingen 2D dataset, an overly large encoder is not needed.

The most common ResNet has five types, including ResNet-18/34/50/101/152. First, we choose ResNet-50 as the baseline of the main branch of the encoder. Then, we test a replacement of ResNet-50 with ResNet-18/34. During the training, we

Conclusions

In this paper, we proposed a new method for semantic segmentation of very-high-resolution remote sensing imagery. We designed the MPE structure to extract the IRRG image feature and the NDVI/DSM auxiliary feature. The two branches of the MPE are asymmetric, which can extract different types of features from different data according to different characteristics, simultaneously saving hardware resources and ensuring accuracy. Based on the DFN and MPE, we proposed the MPVN. Inspired by the CA

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors thank the International Society for Photogrammetry and Remote Sensing (ISPRS) for making the Vaihingen dataset and the Potsdam dataset available online.

The authors thank the editors and anonymous reviewers for their valuable comments, which greatly improved the quality of the paper.

This research was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No. XDA19080302.

References (60)

  • N. Audebert et al.

    Semantic segmentation of earth observation data using multimodal and multi-scale deep networks

  • V. Badrinarayanan et al.

    Segnet: A deep convolutional encoder-decoder architecture for image segmentation

    IEEE transactions on pattern analysis and machine intelligence

    (2017)
  • J.E. Ball et al.

    Comprehensive survey of deep learning in remote sensing: theories, tools, and challenges for the community

    J. Appl. Remote Sens.

    (2017)
  • J.A. Benediktsson et al.

    Very high-resolution remote sensing: Challenges and opportunities [point of view]

    Proc. IEEE

    (2012)
  • A. Carleer et al.

    Assessment of very high spatial resolution satellite image segmentations

    Photogrammetric Engineering & Remote Sensing

    (2005)
  • M. Castelluccio, G. Poggi, C. Sansone, L. Verdoliva, Land use classification in remote sensing images by convolutional...
  • X.-W. Chen et al.

    Big data deep learning: challenges and perspectives

    IEEE access

    (2014)
  • L.-C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution for semantic image segmentation, arXiv...
  • G. Cheng et al.

    Effective and efficient midlevel visual elements-oriented land-use classification using vhr remote sensing images

    IEEE Trans. Geosci. Remote Sens.

    (2015)
  • M. Cordts et al.

    The cityscapes dataset for semantic urban scene understanding, in

  • J. Fu et al.

    Dual attention network for scene segmentation, in

  • M. Gerke, Use of the stair vision library within the isprs 2d semantic labeling...
  • R. Girshick, Fast r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2015, pp....
  • R. Girshick et al.

    Rich feature hierarchies for accurate object detection and semantic segmentation, in

  • K. He et al.

    Deep residual learning for image recognition, in

  • S. Hu et al.

    Automated urban land-use classification with remote sensing

    Int. J. Remote Sens.

    (2013)
  • J. Hu et al.

    Squeeze-and-excitation networks, in

  • International society for photogrammetry and remote sensing (isprs) 2d semantic labeling contest,...
  • International society for photogrammetry and remote sensing (isprs) semantic labeling contest (2d) results,...
  • S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift,...
  • Cited by (89)

    • Category attention guided network for semantic segmentation of Fine-Resolution remote sensing images

      2024, International Journal of Applied Earth Observation and Geoinformation
    View all citing articles on Scopus
    View full text