ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data
Introduction
Semantic labelling of very high resolution (VHR) remotely-sensed images, i.e., the task of assigning a category to every pixel in an image, is of great interest for a wide range of urban applications including land-use planning, infrastructure management, as well as urban sprawl detection (Matikainen and Karila, 2011, Zhang and Seto, 2011, Lu et al., 2017, Goldblatt et al., 2018). Labelling tasks generally focus on extracting one specific category, e.g., building, road, or certain vegetation types (Li et al., 2015, Cheng et al., 2017, Wen et al., 2017), or multiple classes all together (Paisitkriangkrai et al., 2016, Längkvist et al., 2016, Liu et al., 2018, Marmanis et al., 2018).
Extracting spatially consistent information in urban environments from remotely-sensed imagery remains particularly challenging for two main reasons. First, urban classes often display a high within-class variability and a low between-class variability. On the one hand, man-made objects of the same semantic class are often built in different materials and with different structures, leading to an incredible diversity of colors, sizes, shapes, and textures. On the other hand, semantically-different man-made objects can present similar characteristics, e.g., cement rooftops, cement sidewalks, and cement roads. Therefore, objects with similar spectral signatures can belong to completely different classes. Second, the intricate three-dimensional structure of urban environments is favourable to interactions between these objects, e.g., through occlusions and cast shadows.
Circumventing these issues requires going beyond the sole use of spectral information and including geometric elements of the urban class appearance such as pattern, shape, size, context, and orientation. Nonetheless, pixel-based classifications still fail to satisfy the accuracy requirements because they are affected by the salt-and-pepper effect and cannot fully exploit the rich information content of VHR data (Myint et al., 2011, Li and Shao, 2014). GEographic Object-Based Imagery Analysis (GEOBIA) is an alternative image processing approach that seeks to group pixels into meaningful objects based on specified parameters (Blaschke et al., 2014). Popular image segmentation algorithm in remote sensing include watershed segmentation (Vincent and Soille, 1991), multi-resolution segmentation (Baatz and Schäpe, 2000) and mean-shift segmentation (Comaniciu and Meer, 2002). In addition, GEOBIA also allows to compute additional attributes related to the texture, context, and shape of the objects, which can then be added to the classification feature set. However, there is no universally-accepted method to identify the segmentation parameters that provide optimal pixel grouping, which implies the GEOBIA is still highly interactive and includes subjective trial-and-error methods and arbitrary decisions. Furthermore, image segmentation might fail to simultaneously address the wide range of object sizes that one typically encounters in urban landscapes ranging from finely structure objects such as cars and trees to larger objects such as buildings. Another drawback is that GEOBIA relies on pre-selected features for which the maximum attainable accuracy is a priori unknown. While several methods have been devised to extract and select features, these methods are not themselves learned from the data, and are thus potentially sub-optimal.
In recent years, deep learning methods and Convolutional Neural Networks (CNNs) in particular (LeCun et al., 1989) have surpassed traditional methods in various computer vision tasks, such as object detection, semantic, and instance segmentation (see Rawat and Wang, 2017, for a comprehensive review). Some of the key advantages of CNN-based algorithms is that they provide end-to-end solutions, that require minimal feature engineering which offer greater generalization capabilities. They also perform object-based classification, i.e., they take into account features that characterize entire image objects, thereby reducing the salt-and-pepper effect that affects conventional classifiers.
Our approach to annotate image pixels with class labels is object-based, that is, the algorithm extracts characteristic features from whole (or parts of) objects that exist in images such as cars, trees, or corners of buildings and assigns a vector of class probabilities to each pixel. In contrast, using standard classifiers such as random forests, the probability of each class per pixel is based on features inherent in the spectral signature only. Features based on spectral signatures contain less information than features based on objects. For example, looking at a car we understand not only it’s spectral features (color) but also how these vary as well as the extent these occupy in an image. In addition, we understand that it is more probable a car to be surrounded by pixels belonging to a road, and less probable to be surrounded by pixels belonging to buildings. In the field of computer vision, there is a vast literature on various modules used in convolutional neural networks that make use of this idea of “per object classification”. These modules, such as atrous convolutions (Chen et al., 2016) and pyramid pooling (He et al., 2014, Zhao et al., 2017a), boost the algorithmic performance on semantic segmentation tasks. In addition, after the residual networks era (He et al., 2015) it is now possible to train deeper neural networks avoiding to a great extent the problem of vanishing (or exploding) gradients.
Here, we introduce a novel Fully Convolutional Network (FCN) for semantic segmentation, termed ResUNet-a. This network combines ideas distilled from computer vision applications of deep learning, and demonstrates competitive performance. In addition, we describe a modeling framework consisting of a new loss function that behaves well for semantic segmentation problems with class imbalance as well as for regression problems. In summary, the main contributions of this paper are the following:
- 1.
A novel architecture for understanding and labeling very high resolution images for the task of semantic segmentation. The architecture uses a UNet (Ronneberger et al., 2015) encoder/decoder backbone, in combination with, residual connections (He et al., 2016), atrous convolutions (Chen et al., 2016, Chen et al., 2017), pyramid scene parsing pooling (Zhao et al., 2017a) and multi tasking inference (Ruder, 2017, we present two variants of the basic architecture, a single task and a multi-task one).
- 2.
We analyze the performance of various flavours of the Dice coefficient for semantic segmentation. Based on our findings, we introduce a variant of the Dice loss function that speeds up the convergence of semantic segmentation tasks and improves performance. Our results indicate that the new loss function behaves well even when there is a large class imbalance. This loss can also be used for continuous variables when the target domain of values is in the range [0,1].
In addition, we also present a data augmentation methodology, where the input is viewed in multiple scales during training by the algorithm, that improves performance and avoids overfitting. The performance of ResUNet-a was tested using the Potsdam data set made available through the ISPRS competition (ISPRS). Validation results show that ResUNet-a achieves state-of-the-art results.
This article is organized as follows. In Section 2 we provide a short review of related work on the topic of semantic segmentation focused on the field of remote sensing. In Section 3, we detail the model architecture and the modeling framework. Section 4 describes the data set we used for training our algorithm. In Section 5 we provide an experimental analysis that justifies the design choices for our modeling framework. Finally, Section 6 presents the performance evaluation of our algorithm and comparison with other published results. Readers are referred to Section A for a description of our software implementation and hardware configurations, and to Section C for the full error maps on unseen test data.
Section snippets
Related work
The task of semantic segmentation has attracted significant interest in the latest years, not only in the field of computer vision community but also in other disciplines (e.g. biomedical imaging, remote sensing) where automated annotation of images is an important process. In particular, specialized techniques have been developed over different disciplines, since there are task-specific peculiarities that the community of computer vision does not have to address (and vice versa).
Starting from
The ResUNet-a framework
In this section, we introduce the architecture of ResUNet-a in full detail (Section 3.1), a novel loss function design to achieve faster convergence and higher performance (Section 3.2), data augmentation methodology (Section 3.3) as well as the methodology we followed on performing inference on large images (Section 3.4). The training strategy and software implementation characteristics can be found in Appendix A.
Data and preprocessing
We sourced data from the ISPRS 2D Semantic Labelling Challenge and in particular the Potsdam data set (ISPRS). The data consist of a set of true orthophoto (TOP) extracted from a larger mosaic, and a Digital Surface Model (DSM). The TOP consists of the four spectral bands in the visible (VIS) red (R), green (G), and blue (G) and in the near infrared (NIR) and the ground sampling distance is 5 cm. The normalized DSM layer provides information on the height of each pixel as the ground elevation
Architecture and Tanimoto loss experimental analysis
In this section, we perform an experimental analysis of the ResUNet-a architecture as well as the performance of the Tanimoto with complement loss function.
Results and discussion
In this section, we present and discuss the performance of ResUNet-a. We also compare the efficiency of our model with results from architectures of other authors. We present results in both the FoV × 4 and FoV × 1 versions of the ISPRS Potsdam dataset. It should be noted that the ground truth masks of the test set were made publicly available on June 2018. Since then, the ISPRS 2D semantic label online test results are not being updated. The ground truth labels used to calculate the
Conclusions
In this work, we present a new deep learning modeling framework, for semantic segmentation of high resolution aerial images. The framework consists of a novel multitasking deep learning architecture for semantic segmentation and a new variant of the Dice loss that we term Tanimoto.
Our deep learning architecture, ResUNet-a, is based on the encoder/decoder paradigm, where standard convolutions are replaced with ResNet units that contain multiple in parallel atrous convolutions. Pyramid scene
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors acknowledge the support of the Scientific Computing team of CSIRO, and in particular Peter H. Campbell and Ondrej Hlinka. Their contribution was substantial in overcoming many technical difficulties of distributed GPU computing. The authors are also grateful to John Taylor for his help in understanding and implementing distributed optimization using Horovod (Sergeev and Balso, 2018). The authors acknowledge the support of the mxnet community, and in particular Thomas Delteil, Sina
References (86)
- et al.
Beyond rgb: Very high resolution urban remote sensing with multimodal deep networks
ISPRS J. Photogramm. Remote Sens.
(2018) - et al.
Geographic object-based image analysis–towards a new paradigm
ISPRS J. Photogramm. Remote Sens.
(2014) Distance transformations in digital images
Comput. Vision Graph. Image Process.
(1986)- et al.
Using landsat and nighttime lights for supervised pixel-based image classification of urban land cover
Remote Sens. Environ.
(2018) - et al.
Semantic labeling in very high resolution images via a self-cascaded convolutional neural network
ISPRS J. Photogramm. Remote Sens.
(2018) - et al.
Deep learning in remote sensing applications: a meta-analysis and review
ISPRS J. Photogramm. Remote Sens.
(2019) - et al.
Classification with an edge: Improving semantic image segmentation with boundary detection
ISPRS J. Photogramm. Remote Sens.
(2018) Comparison of the predicted and observed secondary structure of t4 phage lysozyme
Biochimica et Biophysica Acta (BBA) – Protein Structure
(1975)- et al.
Per-pixel vs. object-based classification of urban land cover extraction using high spatial resolution imagery
Remote Sens. Environ.
(2011) - et al.
Mapping urbanization dynamics at regional and global scales using multi-temporal dmsp/ols nighttime light data
Remote Sens. Environ.
(2011)
Contextually guided very-high-resolution imagery classification with semantic segments
ISPRS J. Photogramm. Remote Sens.
Segment-before-detect: vehicle detection and classification through semantic segmentation of aerial images
Remote Sens.
Automatic road detection and centerline extraction via cascaded end-to-end convolutional neural network
IEEE Trans. Geosci. Remote Sens.
Mean shift: a robust approach toward feature space analysis
IEEE Trans. Pattern Anal. Mach. Intell.
Generalized overlap measures for evaluation and validation in medical image analysis
IEEE Trans. Med. Imaging
The pascal visual object classes (voc) challenge
Int. J. Comput. Vision
A survey on deep learning-driven remote sensing image scene understanding: Scene classification, scene retrieval and scene-guided object detection
Appl. Sci.
Cropland mapping over sahelian and sudanian agrosystems: a knowledge-based approach using proba-v time series at 100-m
Remote Sens.
Classification and segmentation of satellite orthoimagery using convolutional neural networks
Remote Sens.
Cited by (874)
MpMsCFMA-Net: Multi-path Multi-scale Context Feature Mixup and Aggregation Network for medical image segmentation
2024, Engineering Applications of Artificial IntelligenceDeforestation detection and reforestation potential due to natural disasters—A case study of floods
2024, Remote Sensing Applications: Society and EnvironmentA deep ensemble medical image segmentation with novel sampling method and loss function
2024, Computers in Biology and MedicineOBBInst: Remote sensing instance segmentation with oriented bounding box supervision
2024, International Journal of Applied Earth Observation and GeoinformationRegion-scalable fitting-assisted medical image segmentation with noisy labels[Formula presented]
2024, Expert Systems with Applications