An attention-fused network for semantic segmentation of very-high-resolution remote sensing imagery

doi:10.1016/j.isprsjprs.2021.05.004

ISPRS Journal of Photogrammetry and Remote Sensing

Volume 177, July 2021, Pages 238-262

https://doi.org/10.1016/j.isprsjprs.2021.05.004 Get rights and content

Abstract

Semantic segmentation is an essential part of deep learning. In recent years, with the development of remote sensing big data, semantic segmentation has been increasingly used in remote sensing. Deep convolutional neural networks (DCNNs) face the challenge of feature fusion: very-high-resolution remote sensing image multisource data fusion can increase the network’s learnable information, which is conducive to correctly classifying target objects by DCNNs; simultaneously, the fusion of high-level abstract features and low-level spatial features can improve the classification accuracy at the border between target objects. In this paper, we propose a multipath encoder structure to extract features of multipath inputs, a multipath attention-fused block module to fuse multipath features, and a refinement attention-fused block module to fuse high-level abstract features and low-level spatial features. Furthermore, we propose a novel convolutional neural network architecture, named attention-fused network (AFNet). Based on our AFNet, we achieve state-of-the-art performance with an overall accuracy of 91.7% and a mean F1 score of 90.96% on the ISPRS Vaihingen 2D dataset and an overall accuracy of 92.1% and a mean F1 score of 93.44% on the ISPRS Potsdam 2D dataset.

Introduction

In recent years, with the rapid development of remote sensing technology, the amount of remote sensing data that has been obtained has grown significantly (Ma et al., 2015). Remotely sensed big data have 4Vs characteristics, which represent volume, variety, velocity, and veracity (Zhang, 2018, Zhang et al., 2019). We can exploit rich and important information from remotely sensed big data. With the improvement in sensor technology, the spatial resolution of remote sensing images is increasing. In high spatial resolution images, the spatial texture details of target objects are preserved (Trias-Sanz et al., 2008). We can use spatial texture information to identify, classify, and even extract accurate contours to exploit rich geological spatial information contained in images. The higher the spatial resolution is, the larger the volume of data and the richer the information it contains (Carleer et al., 2005). The high-resolution remote sensing imagery’s spatial resolution can reach the meter or decimeter level, while the very-high-resolution imagery can reach the centimeter level. In very-high-resolution images, each target object has rich details. We can distinguish and identify different target objects based on these detailed features. Some target objects must be accurately identified by very-high-resolution images (Benediktsson et al., 2012). In low- and medium-resolution images, some similar target objects are easily confused and difficult to distinguish from each other due to the loss of a large amount of texture information. Therefore, very-high-resolution images can be more accurately used in target object recognition and classification and have an advantage over low- and medium-resolution images.

In recent years, deep learning has been developed by leaps and bounds in the field of computer vision (LeCun et al., 2015). Deep learning is a data-driven technology (Reichstein et al., 2019). With the development of big data, deep learning has significant advantages (Chen and Lin, 2014). Deep learning for image analysis is based on deep convolutional neural networks (DCNNs), building complex spatial texture expression models and exploiting content information in images. Deep learning is widely used in applications, such as scene classification (LeCun et al., 1998, Krizhevsky et al., 2012, Szegedy et al., 2015, Simonyan and Zisserman, 2014), object detection (Girshick et al., 2014, Girshick, 2015, Ren et al., 2015, Liu et al., 2016), and semantic segmentation (Long et al., 2015, Badrinarayanan et al., 2017, Ronneberger et al., 2015, Noh et al., 2015). Among these applications, semantic segmentation is the classification of each pixel in a picture, which is a kind of pixel-level image classification. Since all pixels are classified, the contours of different types of target objects can be accurately extracted. The positions, shapes, and spatial distribution of the target objects are more accurate.

In the field of remote sensing, the typical applications of semantic segmentation are land-use mapping (Castelluccio et al., 2015, Cheng et al., 2015, Hu and Wang, 2013), land-cover mapping (Friedl and Brodley, 1997, Running et al., 1995, Townshend et al., 1991), building extraction (Lefèvre et al., 2007, Vu et al., 2009), waterbody extraction (Zhaohui et al., 2003, Shen and Li, 2010), and so on. Semantic segmentation based on traditional remote sensing methods requires the artificial design of corresponding feature extractors according to the characteristics of different target objects. The artificially designed feature extractors have high professional knowledge requirements (Ball et al., 2017), cannot adapt to complex application scenarios and have limited generalization capabilities. Deep learning-based semantic segmentation can effectively overcome the limitations of traditional remote sensing methods (Zhang et al., 2016). This method can extract rich features and has strong robustness. DCNN learns the feature information of different target objects by itself, thereby achieving pixel-level image classification, and the method has a strong generalization ability.

However, there are also some difficulties in the application of deep learning in the field of remote sensing, and these difficulties are outlined as follows:

•
Images in the field of computer vision are generally RGB three-channel images. However, remote sensing images are composed of multiband data. There are also some other types of remote sensing data, such as the normalized difference vegetation index (NDVI) and digital surface model (DSM). These data are not obtained by optical sensors and have different characteristics from ordinary optical images. The most popular DCNNs work with three-channel RGB optical images. Although those DCNNs can work with single-channel or multichannel images, it is not appropriate if we simply stack the optical data and the other structural data. It is harder to train a network by using one encoder to extract multisource data features than by using individual encoders to learn the individual modalities. Fusing the separate features in the decoder will simplify the training objective. Current fusion methods for the features extracted from multisource data rely on summing the feature maps (Audebert et al., 2018, Audebert et al., 2016) or concatenating individual feature maps (Sherrah, 2016, Marmanis et al., 2018). The effective fusion of the features remains an open research direction.
•
The DCNN is a stack of many convolutional layers and pooling layers. The convolutional layer is used to extract features, and the pooling layer is used to aggregate features. The deeper the network is, the more abstract the extracted information. However, in the pooling layer, a significant amount of spatial information is lost. The shallow part of the network cannot adequately extract abstract information, but the spatial information is kept intact. Semantic segmentation must be able to both extract abstract information and retain more accurate position information to achieve correct pixel-level image classification. The scenes of remote sensing images are very complicated. The effective fusion of low-level spatial features and high-level abstract features is a problem that needs to be further optimized.

In summary, these difficulties include two types of feature fusion: 1) multipath feature fusion extracted from multisource data and 2) multilevel feature fusion for high-level abstract features and low-level spatial features. However, mainstream DCNNs cannot yet efficiently and effectively deal with the problems of feature fusion. In this paper, we propose a novel attention-fused network (AFNet) architecture, including the multipath attention-fused block (MAFB) module and refinement attention-fused block (RAFB) module, which perform well in solving the problems of ”multipath feature fusion” and ”multilevel feature fusion”.

The MAFB module is designed to solve the difficulty of ”multipath feature fusion”. In the task of semantic segmentation for target objects, data from different sources may play a key role. Therefore, to ensure that multipath features extracted from different inputs are treated equally, we use a symmetric structure to feed these features into MAFB. To suppress the interference of useless feature information on the classification results, we introduce an attention structure. We use a channel attention (Hu et al., 2018) module to calculate the feature weights in the channel dimension and obtain the key channel features. We use a spatial attention (Woo et al., 2018) module to calculate the feature weights in the spatial dimension to obtain the key spatial features. The fusion of these two key features completes the selection and fusion of the multipath features.

The RAFB module is designed to solve the difficulty of ”multilevel feature fusion”. We use a channel attention module to calculate the feature weights in the channel dimension from the high-level abstract features and then use the feature weights to select the useful low-level spatial features to improve the abstract expression ability of the low-level spatial features. We use a spatial attention module to calculate the feature weights in the spatial dimension from the low-level spatial features and then use the feature weights to refine the spatial details of the high-level abstract features. Finally, we fuse these two refined features and obtain the fused multilevel features.

In summary, the contributions of this paper are described as follows:

•
Inspired by the channel attention structure and the spatial attention structure, we design a variant spatial attention module. The variant spatial attention module is designed to calculate the feature weights in the spatial dimension and extract useful key spatial features.
•
We design a multipath encoder (MPE) structure to simultaneously extract the abstract features and the spatial features from the different data input sources. We rethink the method of feature fusion in the DCNN and design a multipath attention-fused block (MAFB) module to fuse the multipath features from the MPE structure.
•
We design a refinement attention-fused block (RAFB) module to fuse low-level spatial features and high-level abstract features. According to the characteristics of different level features, the RAFB module makes full use of the advantages of those features.
•
By integrating the MPE structure with the MAFB module and the RAFB module, we propose an attention-fused network (AFNet) to simultaneously address the ”multipath feature fusion” and ”multilevel feature fusion” issues. An overview of the AFNet architecture is shown in Fig. 1. Our proposed AFNet achieves state-of-the-art performances on the ISPRS Vaihingen 2D dataset and the ISPRS Potsdam 2D dataset.¹

The remainder of this paper is organized as follows: Section 2 presents the related work. In Section 3, we introduce our proposed methodology about the multipath V-shape network (MPVN), MAFB, RAFB, and AFNet architecture. Section 4 experimentally validates AFNet on the ISPRS Vaihingen 2D dataset and ISPRS Potsdam 2D dataset. In Section 5, we discuss the impact of training parameters on AFNet. Section 6 presents the conclusion of this paper.

Section snippets

Related work

Several popular CNNs developed in recent years, such as AlexNet (Krizhevsky et al., 2012), VGGNet (Simonyan and Zisserman, 2014), GoogLeNet (Szegedy et al., 2015), and ResNet (He et al., 2016), have been used in scene classification. FCN (Long et al., 2015) is the first fully convolutional neural network that is designed for semantic segmentation. FCN uses skip connections to refine feature maps and upsample the output feature maps to the size of the input origin data. However, the abstraction

Methodology

DCNNs rely on encoders to extract features. On the one hand, the feature map is downsampled multiple times in the encoder to save the cost of hardware resources, and on the other hand, it is downsampled multiple times to aggregate feature information and increase the receptive field of the CNN. As the encoder gradually deepens, the feature information becomes increasingly abstract, and the feature expression ability becomes increasingly stronger. However, as shown in Fig. 2, a significant

ISPRS Vaihingen 2D dataset

The ISPRS Vaihingen 2D dataset is a benchmark dataset of aerial remote sensing images labeled by the International Society for Photogrammetry and Remote Sensing (ISPRS). The dataset contains six types of land-cover categories, namely, impervious surfaces (imp_surf), buildings, low vegetation (low_veg), trees, cars, and clutter/background (clutter). The Vaihingen dataset contains aerial remote sensing images taken by drone in Vaihingen town, Germany. As shown in Fig. 10, there are 33 tiles of

Encoder

The MPE module used by AFNet in this paper includes two branches, ResNet-50 and ResNet-18. The main branch uses ResNet-50. Because there are only 16 tiles of images in the training samples of the ISPRS Vaihingen 2D dataset, an overly large encoder is not needed.

The most common ResNet has five types, including ResNet-18/34/50/101/152. First, we choose ResNet-50 as the baseline of the main branch of the encoder. Then, we test a replacement of ResNet-50 with ResNet-18/34. During the training, we

Conclusions

In this paper, we proposed a new method for semantic segmentation of very-high-resolution remote sensing imagery. We designed the MPE structure to extract the IRRG image feature and the NDVI/DSM auxiliary feature. The two branches of the MPE are asymmetric, which can extract different types of features from different data according to different characteristics, simultaneously saving hardware resources and ensuring accuracy. Based on the DFN and MPE, we proposed the MPVN. Inspired by the CA

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors thank the International Society for Photogrammetry and Remote Sensing (ISPRS) for making the Vaihingen dataset and the Potsdam dataset available online.

The authors thank the editors and anonymous reviewers for their valuable comments, which greatly improved the quality of the paper.

This research was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No. XDA19080302.

References (60)

N. Audebert et al.
Beyond rgb: Very high resolution urban remote sensing with multimodal deep networks
ISPRS Journal of Photogrammetry and Remote Sensing
(2018)
M.A. Friedl et al.
Decision tree classification of land cover from remotely sensed data
Remote sensing of environment
(1997)
R. Kemker et al.
Algorithms for semantic segmentation of multispectral remote sensing imagery using deep learning
ISPRS journal of photogrammetry and remote sensing
(2018)
Y. Liu et al.
Semantic labeling in very high resolution images via a self-cascaded convolutional neural network
ISPRS journal of photogrammetry and remote sensing
(2018)
Y. Ma et al.
Remote sensing big data computing: Challenges and opportunities
Future Generation Computer Systems
(2015)
D. Marmanis et al.
Classification with an edge: Improving semantic image segmentation with boundary detection
ISPRS Journal of Photogrammetry and Remote Sensing
(2018)
S.W. Running et al.
A remote sensing based vegetation classification logic for global land cover analysis
Remote sensing of Environment
(1995)
J. Townshend et al.
Global land cover classification by remote sensing: present capabilities and future possibilities
Remote Sens. Environ.
(1991)
R. Trias-Sanz et al.
Using colour, texture, and hierarchial segmentation for high-resolution remote sensing
ISPRS Journal of Photogrammetry and remote sensing
(2008)
T.T. Vu et al.
Multi-scale solution for building extraction from lidar and image data
Int. J. Appl. Earth Obs. Geoinf.
(2009)

N. Audebert et al.

Semantic segmentation of earth observation data using multimodal and multi-scale deep networks

V. Badrinarayanan et al.

Segnet: A deep convolutional encoder-decoder architecture for image segmentation

IEEE transactions on pattern analysis and machine intelligence

(2017)

J.E. Ball et al.

Comprehensive survey of deep learning in remote sensing: theories, tools, and challenges for the community

J. Appl. Remote Sens.

(2017)

J.A. Benediktsson et al.

Very high-resolution remote sensing: Challenges and opportunities [point of view]

Proc. IEEE

(2012)

A. Carleer et al.

Assessment of very high spatial resolution satellite image segmentations

Photogrammetric Engineering & Remote Sensing

(2005)

M. Castelluccio, G. Poggi, C. Sansone, L. Verdoliva, Land use classification in remote sensing images by convolutional...

X.-W. Chen et al.

Big data deep learning: challenges and perspectives

IEEE access

(2014)

L.-C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution for semantic image segmentation, arXiv...

G. Cheng et al.

Effective and efficient midlevel visual elements-oriented land-use classification using vhr remote sensing images

IEEE Trans. Geosci. Remote Sens.

(2015)

M. Cordts et al.

The cityscapes dataset for semantic urban scene understanding, in

J. Fu et al.

Dual attention network for scene segmentation, in

M. Gerke, Use of the stair vision library within the isprs 2d semantic labeling...

R. Girshick, Fast r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2015, pp....

R. Girshick et al.

Rich feature hierarchies for accurate object detection and semantic segmentation, in

K. He et al.

Deep residual learning for image recognition, in

S. Hu et al.

Automated urban land-use classification with remote sensing

Int. J. Remote Sens.

(2013)

J. Hu et al.

Squeeze-and-excitation networks, in

International society for photogrammetry and remote sensing (isprs) 2d semantic labeling contest,...

International society for photogrammetry and remote sensing (isprs) semantic labeling contest (2d) results,...

S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift,...

Cited by (89)

Progressive Adjacent-Layer coordination symmetric cascade network for semantic segmentation of Multimodal remote sensing images
2024, Expert Systems with Applications
Semantic segmentation of remote sensing images is a fundamental task in computer vision, with significant applications in forest and farmland cover surveys, geological disaster monitoring, and other related fields. The inclusion of digital surface models can enhance the segmentation performance compared to using unimodal imaging alone. However, most existing methods simply combine the features from both modalities without considering their differences and complementarity, leading to a loss of spatial details. In order to address this issue and improve segmentation accuracy, we propose a novel network called Progressive Adjacent-Layer Coordination Symmetric Cascade Network (PACSCNet). This network employs a two-stage fusion symmetric cascade encoder to leverage the similarities and differences between adjacent features for cross-layer fusion, thereby preserving spatial details. Additionally, our network includes a new dual-pyramid symmetric cascade decoder that extracts similarities in multimodal and cross-layer fusion features for merging. Furthermore, a pyramid residual integration module progressively integrates features at four scales to mitigate noise interference during large-scale fusion. Extensive experimental evaluations on the Vaihingen and Potsdam datasets demonstrate that PACSCNet achieves strong semantic segmentation performance, comparable to state-of-the-art approaches, in terms of accuracy and intersection-over-union. The source code and results of our proposed PACSCNet are publicly available at https://github.com/F8AoMn/PACSCNet.
Category attention guided network for semantic segmentation of Fine-Resolution remote sensing images
2024, International Journal of Applied Earth Observation and Geoinformation
The semantic segmentation task is an essential issue in various fields, including land cover classification and cultural heritage investigation. The CNN and Transformer have been widely utilized in semantic segmentation tasks due to notable advancements in deep learning technologies. However, these methodologies may not fully account for remote sensing images' distinctive attributes, including the large intra-class variation and the small inter-class variation. Driven by it, we propose a category attention guided network (CAGNet). Initially, a local feature extraction module is devised to cater to striped objects and features at different scales. Then, we propose a novel concept of category attention for remote sensing images as a feature representation of category differences between pixels. Meanwhile, we designed the Transformer-based and CNN-based category attention guided modules to integrate the proposed category attention into the global scoring functions and local category feature weights, respectively. The network is designed to give more attention to the category features by updating these weights during the training process. Finally, a feature fusion module is developed to integrate global, local, and category multi-scale features and contextual information. A series of extensive experiments along with ablation studies on the UAVid, Vaihingen, and Potsdam datasets indicate that our network outperforms existing methods, including those based on CNN and Transformer.
Assisted learning for land use classification: The important role of semantic correlation between heterogeneous images
2024, ISPRS Journal of Photogrammetry and Remote Sensing
In recent times, notable advancements have been achieved in amalgamating heterogeneous remote sensing imagery to facilitate Earth observation through the adoption of convolutional neural networks. Nonetheless, due to the variety in imaging mechanisms and imbalanced information prevalent among heterogeneous data, the efficacious exploitation of semantic correlation across different modalities for generating discriminative features continues to pose a formidable challenge. Moreover, not all modalities included in the training dataset can be obtained in real-world test scenarios. Hence, the following inquiry arises: How to explicitly leverage semantic correlation between heterogeneous images to construct discriminative features and maintain performance in test scenarios with missing modalities? To address this pressing concern, we propose an innovative assisted learning framework that employs a “teacher-student” architecture equipped with local and global distillation schemes. We partition the framework into two distinct segments, where each segment acquires specific knowledge independently. In terms of local distillation, the teacher network fosters discriminative feature extraction in the student network using a pixel-wise approach, augmented by the inclusion of a regularization factor to ensure the accuracy of knowledge transfer. For global distillation, the student network is motivated to assimilate category-related information derived from the teacher network, thereby further enriching the knowledge encoding. Extensive evaluations of the datasets, utilizing either optical image/Digital Surface Model (DSM) or optical image/Synthetic Aperture Radar (SAR) for land use classification, provide evidence favoring the effectiveness of the proposed method. Code is available at https://github.com/WHUlwb/Assisted_learning.
Improving agricultural field parcel delineation with a dual branch spatiotemporal fusion network by integrating multimodal satellite data
2023, ISPRS Journal of Photogrammetry and Remote Sensing
Accurate spatial information for agricultural field parcels is important for agricultural production management and understanding agro-industrialization and intensification. However, traditional remote sensing methods that rely on single-modal or single-date data struggle to identify heterogeneous field parcels, particularly in regions dominated by smallholder farming systems. To address this challenge, we proposed a Dual branch Spatiotemporal Fusion Network (DSTFNet) that integrated very high-resolution (VHR) images and medium-resolution satellite image time series (MRSITS) to extract agricultural field parcels over various landscapes. The DSTFNet consisted of two branches: a spatial branch that extracted spatial features from VHR images and a temporal branch that explored seasonal spectral dynamics from MRSITS data by using ConvLSTM units and an attention module. We evaluated the DSTFNet in four regions across China by using GF-2 and Sentinel-2 data. The results showed that DSTFNet performed well in delineating agricultural field parcels, achieving the highest Matthew’s correlation coefficient (MCC) = 0.823 for the field extent, the highest F1-score of edge (F_edge) = 0.865 for field boundary, and the lowest errors of segmentation evaluation index (SEI) = 0.191 for the vectorized field parcels in Hubei province. In addition, DSTFNet significantly outperformed two single-branch models that used VHR or MRSITS alone, as well as existing BsiNet, ResUNet_a, UNet and RAUNet models. DSTFNet also showed good spatial transferability in distinct regions without training data (on average, MCC = 0.728, F_edge = 0.729, and SEI = 0.281 for three target regions). Using limited training data to fine-tune the DSTFNet can further improve its ability to delineate field parcels over complex regions. The visualization analysis of temporal attention weights demonstrated that DSTFNet can well capture cropland spectral dynamics, making it advantageous in extracting diverse cropland parcels. By exploiting important spectral, spatial and temporal information from multimodal satellite data, DSTFNet provided an effective, robust, and transferable solution for accurately delineating agricultural field parcels across heterogeneous farming systems.
A large-scale climate-aware satellite image dataset for domain adaptive land-cover semantic segmentation
2023, ISPRS Journal of Photogrammetry and Remote Sensing
A few well-annotated datasets for land-cover semantic segmentation have recently been introduced to advance the field of earth observation technologies. However, these datasets overlook the significant diversity among geographic areas with different climates, which can greatly impact and diversify land cover. Consequently, this leads to a domain gap in remote sensing images and severe performance degradation of the segmentation models. To enhance land-cover semantic segmentation with improved generalization ability, we conducted the first investigation into the impact of climate on this task. In this paper, we present a unique large-scale Climate-Aware Satellite Images Dataset (CASID) specifically designed for domain adaptive land-cover semantic segmentation. It consists of 980 satellite images with a size of 5000 × 5000 pixels, collected from 30 different regions around Asia, covering over 24,500 square kilometers. These images are gathered from four distinct climate zones, namely temperate monsoon, subtropical monsoon, tropical monsoon, and tropical rainforest. It includes four sub-datasets/domains, each representing one of the aforementioned climate zones. This characteristic makes CASID the first climate-aware land-cover semantic segmentation dataset with multiple domains. Additionally, we provide a comprehensive analysis of the samples from the four climate zones, emphasizing differences in global image features, image texture, category distribution, spectral value, and object shape. These analyses offer valuable insights for subsequent research in this field. Moreover, we conduct extensive experiments to evaluate the latest semantic segmentation and unsupervised domain adaptation methods on the CASID dataset. These results serve as a robust baseline for future research endeavors. Our dataset will be made publicly available soon at the following link: https://github.com/Linwei-Chen/CASID.
MDMASNet: A dual-task interactive semi-supervised remote sensing image segmentation method
2023, Signal Processing
Remote sensing image (RSIs) segmentation is widely used in urban planning, natural disaster detection and many other fields. Compared with natural scene images, RSIs have higher resolution, complex imaging, and diverse object shapes and sizes, while semantic segmentation methods based on deep learning often require many data labels. In this paper, we propose a semi-supervised RSIs segmentation network with multi-scale deformable threshold feature extraction module and mixed attention (MDMANet). First, a pyramid ensemble structure is used, which incorporates deformable convolution and bole convolution, to extract features of objects with different shapes and sizes and reduce the influence of redundant features. Meanwhile, a mixed attention (MA) is proposed to aggregate long-range contextual relationships and fuse low-level features with high-level features. Second, an FCN-based full convolution discriminator task network is designed to help evaluate the feasibility of unlabeled image prediction results. We performed experimental validation on three datasets, and the results show that MDMANet segmentation provides more significant improvement in accuracy and better generalization than existing segmentation networks.

View all citing articles on Scopus

View full text

An attention-fused network for semantic segmentation of very-high-resolution remote sensing imagery

Abstract

Introduction

Section snippets

Related work

Methodology

ISPRS Vaihingen 2D dataset

Encoder

Conclusions

Declaration of Competing Interest

Acknowledgments

ISPRS Journal of Photogrammetry and Remote Sensing

Remote sensing of environment

ISPRS journal of photogrammetry and remote sensing

ISPRS journal of photogrammetry and remote sensing

Future Generation Computer Systems

ISPRS Journal of Photogrammetry and Remote Sensing

Remote sensing of Environment

Remote Sens. Environ.

ISPRS Journal of Photogrammetry and remote sensing

Int. J. Appl. Earth Obs. Geoinf.

Semantic segmentation of earth observation data using multimodal and multi-scale deep networks

Segnet: A deep convolutional encoder-decoder architecture for image segmentation

IEEE transactions on pattern analysis and machine intelligence

Comprehensive survey of deep learning in remote sensing: theories, tools, and challenges for the community

J. Appl. Remote Sens.

Very high-resolution remote sensing: Challenges and opportunities [point of view]

Proc. IEEE

Assessment of very high spatial resolution satellite image segmentations

Photogrammetric Engineering & Remote Sensing

Big data deep learning: challenges and perspectives

IEEE access

Effective and efficient midlevel visual elements-oriented land-use classification using vhr remote sensing images

IEEE Trans. Geosci. Remote Sens.

The cityscapes dataset for semantic urban scene understanding, in

Dual attention network for scene segmentation, in

Rich feature hierarchies for accurate object detection and semantic segmentation, in

Deep residual learning for image recognition, in

Automated urban land-use classification with remote sensing

Int. J. Remote Sens.

Squeeze-and-excitation networks, in