Deep learning-based remote and social sensing data fusion for urban region function recognition

https://doi.org/10.1016/j.isprsjprs.2020.02.014Get rights and content

Abstract

Urban region function recognition is key to rational urban planning and management. Due to the complex socioeconomic nature of functional land use, recognizing urban region function in high-density cities using remote sensing images alone is difficult. The inclusion of social sensing has the potential to improve the function classification performance. However, effectively integrating the multi-source and multi-modal remote and social sensing data remains technically challenging. In this paper, we have proposed a novel end-to-end deep learning-based remote and social sensing data fusion model to address this issue. Two neural network based methods, one based on a 1-dimensional convolutional neural network (CNN) and the other based on a long short-term memory (LSTM) network, have been developed to automatically extract discriminative time-dependent social sensing signature features, which are fused with remote sensing image features extracted via a residual neural network. One of the major difficulties in exploiting social and remote sensing data is that the two data sources are asynchronous. We have developed a deep learning-based strategy to address this missing modality problem by enforcing cross-modal feature consistency (CMFC) and cross-modal triplet (CMT) constraints. We train the model in an end-to-end manner by simultaneously optimizing three costs, including the classification cost, the CMFC cost and the CMT cost. Extensive experiments have been conducted on publicly available datasets to demonstrate the effectiveness of the proposed method in fusing remote and social sensing data for urban region function recognition. The results show that the seemingly unrelated physically sensed image data and social activities sensed signatures can indeed complement each other to help enhance the accuracy of urban region function recognition.

Introduction

Nowadays, more than half of the world population resides in cities, which only cover less than 2% of the earth surface. As rapid urbanization is undergoing in Asia and Africa, the urban population is thus still expanding and estimated to grow to 5 billion by 2030 across the whole world (Tu et al., 2018). Therefore, it is of great importance to monitor and manage the limited urban areas for such a huge population.

Urban region function recognition is key to rational urban planning and management. It refers to the inference of the usage purposes of urban regions directly associated with human activities, such as residential, commercial, entertaining, and educational (Zhang et al., 2017, Zhang et al., 2019). It is related to but also different from traditional land use and land cover (LULC) classification; the latter usually stresses on physical characteristics of the earth surface, while the former focuses purely on socioeconomic functional attributes of urban regions. LULC monitoring using remote sensing imagery has been proven to be efficient and effective, since these images can well capture the natural appearance of the land surface. However, region function recognition using remote sensing images alone is not sufficient, especially in high-density cities, such as Shenzhen, London, and New York. This is due to the following facts: (1) urban region functions are of socioeconomic properties and determined by the related human activities; (2) shadows of numerous high-rise buildings in high-density cities pose great challenges for remote sensing image processing; (3) mixed urban functions are often clustered in one building or block in east Asian cities.

With the rapid development of information and communication technologies (ICTs), social sensing big data recording human dynamics are becoming increasingly available, such as vehicle GPS trajectories (Tu et al., 2018, Liu et al., 2012), points of interest (POI) (Liu et al., 2017, Hu et al., 2016), mobile phone positioning data (Jia et al., 2018, Tu et al., 2017), social media check-in data (Tu et al., 2017, Gao et al., 2017), and geotagged photos (Cao et al., 2018, Zhu and Newsam, 2015). Different from remote sensing images, these social sensing data are the by-products of human daily life; therefore, they contain rich socioeconomic attributes. When these data meet with remote sensing, the promising trend is to fuse them to recognize urban functions, since the two kinds of data are complementary to each other (Liu et al., 2015). However, remote and social sensing data are significantly different in terms of sources and modalities. Generally, remote sensing images cover the study area. While social sensing data are place-based thus represented by points, polylines, or polygons. Besides, the features of social sensing data may be time-based (Tu et al., 2017), rather than space-based. The fusion of the two multi-source and multi-modal data is no trivial. The key challenge is to alleviate the modality gap and heterogeneity between them.

The emergence of deep learning has advanced many research fields, including image recognition (LeCun et al., 2015), time series classification (Fawaz et al., 2019), and etc. They have also greatly boosted the development of remote sensing (Zhu et al., 2017, Zhang et al., 2016). Significant improvements have been made in many tasks, such as hyperspectral image analysis (Li et al., 2019), image scene classification (Cheng et al., 2017), semantic labeling (Audebert et al., 2018), object detection (Cheng and Han, 2016), and image retrieval (Cao et al., 2020). The major advantages of deep learning approaches are the powerful abilities to automatically learn high-level features from large amount of data, which are vital to bridge the gap between different data modalities at feature level. Therefore, deep learning-based fusion methods are very potential to integrate the multi-source and multi-modal remote and social sensing data.

In this paper, to address the problem of urban region function recognition with cross-modal data sources, we propose an end-to-end deep learning-based multi-modal data fusion method to integrate remote sensing images and social sensing signatures. The two kinds of data are firstly fed into modal-specific encoders of residual convolutional neural network (CNN) and our proposed 1d CNN/LSTM-based network respectively to extract effective features, then those features are fused, and the outputs are finally put into fully connected layers and softmax layer to make predictions. We also propose two ancillary losses to further constrain the network training by drawing the two extracted multi-modal features nearer, which ensures the robustness against missing modalities. Open available datasets are exploited to evaluate our methods, and the results demonstrate their effectiveness and efficiency. In addition, thorough analysis of the methods and results have been conducted to provide insights into fusing the two types of data. Our contributions are summarized as follows:

  • 1.

    We propose an end-to-end deep multi-modal fusion method to effectively incorporate the multi-modal remote sensing images and social sensing signatures for urban region function recognition.

  • 2.

    We propose two effective neural networks to extract temporal signature features automatically. One is 1d CNN-based, and the other is LSTM-based, both of which can explicitly take temporal dependencies into account and effectively extract sequential-aware features.

  • 3.

    To address the data asynchronous problem, we propose two auxiliary losses, i.e. the cross-modal feature consistency (CMFC) loss and the cross-modal triplet (CMT) loss, to make the proposed multi-modal fusion network more robust to missing modalities without significantly impacting the performances.

  • 4.

    We have conducted extensive experiments on open available datasets to evaluate the effectiveness and efficiency of the proposed methods. We also analyze and discuss the results thoroughly to give insights into fusing the two multi-modal data.

The rest of the paper is organized as follows. In Section 2, we review related works on remote and social sensing for LULC classification and urban function recognition. In Section 3, we illustrate and formulate the problem of integrating remote sensing images and social sensing signatures for urban region function recognition. Section 4 elaborates the proposed methods to extract image and temporal signature features, as well as the multi-modal deep learning fusion for urban region function classification. Section 5 evaluates our methods on publicly available datasets and analyzes the results. In Section 6, we discuss about several important issues concerning the strength and possible improvement of the proposed methods. Finally, we conclude in Section 7.

Section snippets

Remote sensing for LULC classification

Land use and land cover classification through remote sensing imagery is a fundamental research topic in remote sensing community. Due to the limited spatial resolution of optical remote sensing imagery, pixel-centric spectral-based methods are the mainstream of traditional LULC classification works (Blaschke et al., 2014). However, the rapid development of high spatial resolution remote sensing imagery brings opportunities in digging into more complex spatial patterns, and geographic

Problem statement

The problem of urban region function recognition from multi-source and multi-modal remote sensing imagery and social sensing signature data is illustrated in Fig. 1. It can be formally defined as follows: for a region Ri, given the satellite imagery Ii and the social sensing signature Si of the region, the class ci that Ri belongs to is to be predicted, as formulated in Eq. 1:ci=argmaxkp(ck|Ii,Si),where ciC={ck|k=1,2,,C},C is the number of function types. Common functions include

Methodology

In this section, we describe the details of the proposed deep multi-modal fusion network, which is capable of integrating remote sensing images and social sensing temporal signature (TS) data for urban region function recognition.

Datasets

In this paper, the Urban Region Function Classification (URFC) datasets 1 are used to evaluate the proposed methods. The datasets are collected from urban areas in China. Two kinds of data are provided, i.e. satellite images and user visit data, with labels of 9 categories, i.e. residential area (res.), school (sch.), industrial park (ind.), railway station (rail.), airport (air.), park (park), shopping area (shop.), administrative district (adm.),

Capacity for sequential-aware modeling

Temporal dependency is one of the most important properties of the temporal signature data, which indicates the vital role of the sequential order of the data. It is significantly important to take full advantage of the information for time-series data classification. In order to demonstrate the capability of sequential-aware modeling of the proposed 1-d SPP-Net and LSTM-Net, extra experiments have been conducted. The sequential order of the input temporal signatures is randomly shuffled, and

Conclusions

Remote and social sensing data are complementary to each other as they possess their own unique characteristics. The integration of them has the potential to improve the accuracy of urban region function recognition. The key challenge is effective fusion of the two kinds of data. In this paper, we propose an end-to-end deep multi-modal fusion network to effectively fuse satellite imagery and social sensing signature data. The two data sources are put into modal-specific encoders of residual CNN

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

The author acknowledges the financial support from the International Doctoral Innovation Centre, Ningbo Education Bureau, Ningbo Science and Technology Bureau, and the University of Nottingham. This work was also supported by the UK Engineering and Physical Sciences Research Council [Grant No. EP/L015463/1], the National Natural Science Foundation of China (No. 41871329, 71961137003), the Shenzhen Scientific Research and Development Funding Program (No. JCYJ20170818092931604,

References (60)

  • W. Zhang et al.

    Parcel-based urban land use classification in megacity using airborne LiDAR, high resolution orthoimagery, and Google Street View

    Comput. Environ. Urban Syst.

    (2017)
  • C. Zhang et al.

    An object-based convolutional neural network (OCNN) for urban land use classification

    Remote Sens. Environ.

    (2018)
  • Y. Zhang et al.

    Functional urban land use recognition integrating multi-source geospatial data and cross-correlations

    Comput. Environ. Urban Syst.

    (2019)
  • C. Zhang et al.

    Joint deep learning for land cover and land use classification

    Remote Sens. Environ.

    (2019)
  • F. Zhang et al.

    Social sensing from street-level imagery: A case study in learning spatio-temporal urban mobility patterns

    ISPRS J. Photogramm. Remote Sens.

    (2019)
  • Albert, A., Kaur, J., Gonzalez, M.C., 2017. Using convolutional networks and satellite imagery to identify patterns in...
  • Cao, R., Qiu, G., 2018. Urban land use classification based on aerial and ground images. In: Proceedings of the 16th...
  • Cao, J., Tu, W., Li, Q., Zhou, M., Cao, R., 2015. Exploring the distribution and dynamics of functional regions using...
  • R. Cao et al.

    Integrating aerial and street view images for urban land use classification

    Remote Sens.

    (2018)
  • R. Cao et al.

    Enhancing remote sensing image retrieval using a triplet deep metric learning network

    Int. J. Remote Sens.

    (2020)
  • G. Cheng et al.

    Remote sensing image scene classification: Benchmark and state of the art

    Proc. IEEE

    (2017)
  • M. Chi et al.

    A novel methodology to label urban remote sensing images based on location-based social media photos

    Proc. IEEE

    (2017)
  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. ImageNet: A large-scale hierarchical image...
  • X. Deng et al.

    Geospatial big data: new paradigm of remote sensing applications

    IEEE J. Sel. Top. Appl. Earth Obser. Remote Sens.

    (2019)
  • Du, Z., Zhang, X., Li, W., Zhang, F., Liu, R., 2019. A multi-modal transportation data-driven approach to identify...
  • H.I. Fawaz et al.

    Deep learning for time series classification: a review

    Data Min. Knowl. Disc.

    (2019)
  • T. Feng et al.

    Urban zoning using higher-order markov random fields on multi-view imagery data

  • S. Gao et al.

    Extracting urban functional regions from points of interest and human activities on location-based social networks

    Trans. GIS

    (2017)
  • Q. Gao et al.

    Identification of urban regions’ functions in Chengdu, China, based on vehicle trajectory data

    PLOS One

    (2019)
  • P. Ghamisi et al.

    Multisource and multitemporal data fusion in remote sensing: a comprehensive review of the state of the art

    IEEE Geosci. Remote Sens. Mag.

    (2019)
  • Cited by (120)

    • A multimodal fusion framework for urban scene understanding and functional identification using geospatial data

      2024, International Journal of Applied Earth Observation and Geoinformation
    View all citing articles on Scopus
    View full text