ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data

https://doi.org/10.1016/j.isprsjprs.2020.01.013Get rights and content

Abstract

Scene understanding of high resolution aerial images is of great importance for the task of automated monitoring in various remote sensing applications. Due to the large within-class and small between-class variance in pixel values of objects of interest, this remains a challenging task. In recent years, deep convolutional neural networks have started being used in remote sensing applications and demonstrate state of the art performance for pixel level classification of objects. Here we propose a reliable framework for performant results for the task of semantic segmentation of monotemporal very high resolution aerial images. Our framework consists of a novel deep learning architecture, ResUNet-a, and a novel loss function based on the Dice loss. ResUNet-a uses a UNet encoder/decoder backbone, in combination with residual connections, atrous convolutions, pyramid scene parsing pooling and multi-tasking inference. ResUNet-a infers sequentially the boundary of the objects, the distance transform of the segmentation mask, the segmentation mask and a colored reconstruction of the input. Each of the tasks is conditioned on the inference of the previous ones, thus establishing a conditioned relationship between the various tasks, as this is described through the architecture’s computation graph. We analyse the performance of several flavours of the Generalized Dice loss for semantic segmentation, and we introduce a novel variant loss function for semantic segmentation of objects that has excellent convergence properties and behaves well even under the presence of highly imbalanced classes. The performance of our modeling framework is evaluated on the ISPRS 2D Potsdam dataset. Results show state-of-the-art performance with an average F1 score of 92.9% over all classes for our best model.

Introduction

Semantic labelling of very high resolution (VHR) remotely-sensed images, i.e., the task of assigning a category to every pixel in an image, is of great interest for a wide range of urban applications including land-use planning, infrastructure management, as well as urban sprawl detection (Matikainen and Karila, 2011, Zhang and Seto, 2011, Lu et al., 2017, Goldblatt et al., 2018). Labelling tasks generally focus on extracting one specific category, e.g., building, road, or certain vegetation types (Li et al., 2015, Cheng et al., 2017, Wen et al., 2017), or multiple classes all together (Paisitkriangkrai et al., 2016, Längkvist et al., 2016, Liu et al., 2018, Marmanis et al., 2018).

Extracting spatially consistent information in urban environments from remotely-sensed imagery remains particularly challenging for two main reasons. First, urban classes often display a high within-class variability and a low between-class variability. On the one hand, man-made objects of the same semantic class are often built in different materials and with different structures, leading to an incredible diversity of colors, sizes, shapes, and textures. On the other hand, semantically-different man-made objects can present similar characteristics, e.g., cement rooftops, cement sidewalks, and cement roads. Therefore, objects with similar spectral signatures can belong to completely different classes. Second, the intricate three-dimensional structure of urban environments is favourable to interactions between these objects, e.g., through occlusions and cast shadows.

Circumventing these issues requires going beyond the sole use of spectral information and including geometric elements of the urban class appearance such as pattern, shape, size, context, and orientation. Nonetheless, pixel-based classifications still fail to satisfy the accuracy requirements because they are affected by the salt-and-pepper effect and cannot fully exploit the rich information content of VHR data (Myint et al., 2011, Li and Shao, 2014). GEographic Object-Based Imagery Analysis (GEOBIA) is an alternative image processing approach that seeks to group pixels into meaningful objects based on specified parameters (Blaschke et al., 2014). Popular image segmentation algorithm in remote sensing include watershed segmentation (Vincent and Soille, 1991), multi-resolution segmentation (Baatz and Schäpe, 2000) and mean-shift segmentation (Comaniciu and Meer, 2002). In addition, GEOBIA also allows to compute additional attributes related to the texture, context, and shape of the objects, which can then be added to the classification feature set. However, there is no universally-accepted method to identify the segmentation parameters that provide optimal pixel grouping, which implies the GEOBIA is still highly interactive and includes subjective trial-and-error methods and arbitrary decisions. Furthermore, image segmentation might fail to simultaneously address the wide range of object sizes that one typically encounters in urban landscapes ranging from finely structure objects such as cars and trees to larger objects such as buildings. Another drawback is that GEOBIA relies on pre-selected features for which the maximum attainable accuracy is a priori unknown. While several methods have been devised to extract and select features, these methods are not themselves learned from the data, and are thus potentially sub-optimal.

In recent years, deep learning methods and Convolutional Neural Networks (CNNs) in particular (LeCun et al., 1989) have surpassed traditional methods in various computer vision tasks, such as object detection, semantic, and instance segmentation (see Rawat and Wang, 2017, for a comprehensive review). Some of the key advantages of CNN-based algorithms is that they provide end-to-end solutions, that require minimal feature engineering which offer greater generalization capabilities. They also perform object-based classification, i.e., they take into account features that characterize entire image objects, thereby reducing the salt-and-pepper effect that affects conventional classifiers.

Our approach to annotate image pixels with class labels is object-based, that is, the algorithm extracts characteristic features from whole (or parts of) objects that exist in images such as cars, trees, or corners of buildings and assigns a vector of class probabilities to each pixel. In contrast, using standard classifiers such as random forests, the probability of each class per pixel is based on features inherent in the spectral signature only. Features based on spectral signatures contain less information than features based on objects. For example, looking at a car we understand not only it’s spectral features (color) but also how these vary as well as the extent these occupy in an image. In addition, we understand that it is more probable a car to be surrounded by pixels belonging to a road, and less probable to be surrounded by pixels belonging to buildings. In the field of computer vision, there is a vast literature on various modules used in convolutional neural networks that make use of this idea of “per object classification”. These modules, such as atrous convolutions (Chen et al., 2016) and pyramid pooling (He et al., 2014, Zhao et al., 2017a), boost the algorithmic performance on semantic segmentation tasks. In addition, after the residual networks era (He et al., 2015) it is now possible to train deeper neural networks avoiding to a great extent the problem of vanishing (or exploding) gradients.

Here, we introduce a novel Fully Convolutional Network (FCN) for semantic segmentation, termed ResUNet-a. This network combines ideas distilled from computer vision applications of deep learning, and demonstrates competitive performance. In addition, we describe a modeling framework consisting of a new loss function that behaves well for semantic segmentation problems with class imbalance as well as for regression problems. In summary, the main contributions of this paper are the following:

  • 1.

    A novel architecture for understanding and labeling very high resolution images for the task of semantic segmentation. The architecture uses a UNet (Ronneberger et al., 2015) encoder/decoder backbone, in combination with, residual connections (He et al., 2016), atrous convolutions (Chen et al., 2016, Chen et al., 2017), pyramid scene parsing pooling (Zhao et al., 2017a) and multi tasking inference (Ruder, 2017, we present two variants of the basic architecture, a single task and a multi-task one).

  • 2.

    We analyze the performance of various flavours of the Dice coefficient for semantic segmentation. Based on our findings, we introduce a variant of the Dice loss function that speeds up the convergence of semantic segmentation tasks and improves performance. Our results indicate that the new loss function behaves well even when there is a large class imbalance. This loss can also be used for continuous variables when the target domain of values is in the range [0,1].

In addition, we also present a data augmentation methodology, where the input is viewed in multiple scales during training by the algorithm, that improves performance and avoids overfitting. The performance of ResUNet-a was tested using the Potsdam data set made available through the ISPRS competition (ISPRS). Validation results show that ResUNet-a achieves state-of-the-art results.

This article is organized as follows. In Section 2 we provide a short review of related work on the topic of semantic segmentation focused on the field of remote sensing. In Section 3, we detail the model architecture and the modeling framework. Section 4 describes the data set we used for training our algorithm. In Section 5 we provide an experimental analysis that justifies the design choices for our modeling framework. Finally, Section 6 presents the performance evaluation of our algorithm and comparison with other published results. Readers are referred to Section A for a description of our software implementation and hardware configurations, and to Section C for the full error maps on unseen test data.

Section snippets

Related work

The task of semantic segmentation has attracted significant interest in the latest years, not only in the field of computer vision community but also in other disciplines (e.g. biomedical imaging, remote sensing) where automated annotation of images is an important process. In particular, specialized techniques have been developed over different disciplines, since there are task-specific peculiarities that the community of computer vision does not have to address (and vice versa).

Starting from

The ResUNet-a framework

In this section, we introduce the architecture of ResUNet-a in full detail (Section 3.1), a novel loss function design to achieve faster convergence and higher performance (Section 3.2), data augmentation methodology (Section 3.3) as well as the methodology we followed on performing inference on large images (Section 3.4). The training strategy and software implementation characteristics can be found in Appendix A.

Data and preprocessing

We sourced data from the ISPRS 2D Semantic Labelling Challenge and in particular the Potsdam data set (ISPRS). The data consist of a set of true orthophoto (TOP) extracted from a larger mosaic, and a Digital Surface Model (DSM). The TOP consists of the four spectral bands in the visible (VIS) red (R), green (G), and blue (G) and in the near infrared (NIR) and the ground sampling distance is 5 cm. The normalized DSM layer provides information on the height of each pixel as the ground elevation

Architecture and Tanimoto loss experimental analysis

In this section, we perform an experimental analysis of the ResUNet-a architecture as well as the performance of the Tanimoto with complement loss function.

Results and discussion

In this section, we present and discuss the performance of ResUNet-a. We also compare the efficiency of our model with results from architectures of other authors. We present results in both the FoV × 4 and FoV × 1 versions of the ISPRS Potsdam dataset. It should be noted that the ground truth masks of the test set were made publicly available on June 2018. Since then, the ISPRS 2D semantic label online test results are not being updated. The ground truth labels used to calculate the

Conclusions

In this work, we present a new deep learning modeling framework, for semantic segmentation of high resolution aerial images. The framework consists of a novel multitasking deep learning architecture for semantic segmentation and a new variant of the Dice loss that we term Tanimoto.

Our deep learning architecture, ResUNet-a, is based on the encoder/decoder paradigm, where standard convolutions are replaced with ResNet units that contain multiple in parallel atrous convolutions. Pyramid scene

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors acknowledge the support of the Scientific Computing team of CSIRO, and in particular Peter H. Campbell and Ondrej Hlinka. Their contribution was substantial in overcoming many technical difficulties of distributed GPU computing. The authors are also grateful to John Taylor for his help in understanding and implementing distributed optimization using Horovod (Sergeev and Balso, 2018). The authors acknowledge the support of the mxnet community, and in particular Thomas Delteil, Sina

References (86)

  • W. Zhao et al.

    Contextually guided very-high-resolution imagery classification with semantic segments

    ISPRS J. Photogramm. Remote Sens.

    (2017)
  • Abraham, N., Khan, N.M., 2018. A novel focal tversky loss function with improved attention u-net for lesion...
  • N. Audebert et al.

    Segment-before-detect: vehicle detection and classification through semantic segmentation of aerial images

    Remote Sens.

    (2017)
  • Audebert, N., Saux, B.L., Lefèvre, S., 2016. Semantic segmentation of earth observation data using multimodal and...
  • Baatz, M., Schäpe, A., 2000. Multiresolution segmentation: an optimization approach for high quality multi-scale image...
  • Badrinarayanan, V., Kendall, A., Cipolla, R., 2015. Segnet: A deep convolutional encoder-decoder architecture for image...
  • Bertasius, G., Shi, J., Torresani, L., 2015. Semantic segmentation with boundary neural fields. CoRR abs/1511.02674....
  • Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2016. Deeplab: Semantic image segmentation with deep...
  • Chen, L., Papandreou, G., Schroff, F., Adam, H., 2017. Rethinking atrous convolution for semantic image segmentation....
  • Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z., 2015. Mxnet: A flexible...
  • G. Cheng et al.

    Automatic road detection and centerline extraction via cascaded end-to-end convolutional neural network

    IEEE Trans. Geosci. Remote Sens.

    (2017)
  • D. Comaniciu et al.

    Mean shift: a robust approach toward feature space analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2002)
  • W.R. Crum et al.

    Generalized overlap measures for evaluation and validation in medical image analysis

    IEEE Trans. Med. Imaging

    (2006)
  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. ImageNet: A Large-Scale Hierarchical Image...
  • Dice, L.R., 1945. Measures of the amount of ecologic association between species. Ecology 26, 297–302....
  • Drozdzal, M., Vorontsov, E., Chartrand, G., Kadoury, S., Pal, C., 2016. The importance of skip connections in...
  • M. Everingham et al.

    The pascal visual object classes (voc) challenge

    Int. J. Comput. Vision

    (2010)
  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014....
  • Goyal, P., Dollár, P., Girshick, R.B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K., 2017....
  • Y. Gu et al.

    A survey on deep learning-driven remote sensing image scene understanding: Scene classification, scene retrieval and scene-guided object detection

    Appl. Sci.

    (2019)
  • He, K., Girshick, R.B., Dollár, P., 2018. Rethinking imagenet pre-training. CoRR abs/1811.08883....
  • He, K., Gkioxari, G., Dollár, P., Girshick, R.B., 2017. Mask R-CNN. CoRR abs/1703.06870....
  • He, K., Zhang, X., Ren, S., Sun, J., 2014. Spatial pyramid pooling in deep convolutional networks for visual...
  • He, K., Zhang, X., Ren, S., Sun, J., 2015. Deep residual learning for image recognition. CoRR abs/1512.03385....
  • He, K., Zhang, X., Ren, S., Sun, J., 2016. Identity mappings in deep residual networks. CoRR abs/1603.05027....
  • Huang, G., Liu, Z., Weinberger, K.Q., 2016. Densely connected convolutional networks. CoRR abs/1608.06993....
  • Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate...
  • ISPRS, International society for photogrammetry and remote sensing (isprs) and bsf swissphoto: Wg3 potsdam overhead...
  • Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K., 2015. Spatial transformer networks. CoRR abs/1506.02025....
  • Kervadec, H., Bouchtiba, J., Desrosiers, C., Ric Granger, Dolz, J., Ayed, I.B., 2018. Boundary loss for highly...
  • Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980....
  • M.J. Lambert et al.

    Cropland mapping over sahelian and sudanian agrosystems: a knowledge-based approach using proba-v time series at 100-m

    Remote Sens.

    (2016)
  • M. Längkvist et al.

    Classification and segmentation of satellite orthoimagery using convolutional neural networks

    Remote Sens.

    (2016)
  • Cited by (874)

    • OBBInst: Remote sensing instance segmentation with oriented bounding box supervision

      2024, International Journal of Applied Earth Observation and Geoinformation
    View all citing articles on Scopus
    View full text