Deep learning-based remote and social sensing data fusion for urban region function recognition
Introduction
Nowadays, more than half of the world population resides in cities, which only cover less than 2% of the earth surface. As rapid urbanization is undergoing in Asia and Africa, the urban population is thus still expanding and estimated to grow to 5 billion by 2030 across the whole world (Tu et al., 2018). Therefore, it is of great importance to monitor and manage the limited urban areas for such a huge population.
Urban region function recognition is key to rational urban planning and management. It refers to the inference of the usage purposes of urban regions directly associated with human activities, such as residential, commercial, entertaining, and educational (Zhang et al., 2017, Zhang et al., 2019). It is related to but also different from traditional land use and land cover (LULC) classification; the latter usually stresses on physical characteristics of the earth surface, while the former focuses purely on socioeconomic functional attributes of urban regions. LULC monitoring using remote sensing imagery has been proven to be efficient and effective, since these images can well capture the natural appearance of the land surface. However, region function recognition using remote sensing images alone is not sufficient, especially in high-density cities, such as Shenzhen, London, and New York. This is due to the following facts: (1) urban region functions are of socioeconomic properties and determined by the related human activities; (2) shadows of numerous high-rise buildings in high-density cities pose great challenges for remote sensing image processing; (3) mixed urban functions are often clustered in one building or block in east Asian cities.
With the rapid development of information and communication technologies (ICTs), social sensing big data recording human dynamics are becoming increasingly available, such as vehicle GPS trajectories (Tu et al., 2018, Liu et al., 2012), points of interest (POI) (Liu et al., 2017, Hu et al., 2016), mobile phone positioning data (Jia et al., 2018, Tu et al., 2017), social media check-in data (Tu et al., 2017, Gao et al., 2017), and geotagged photos (Cao et al., 2018, Zhu and Newsam, 2015). Different from remote sensing images, these social sensing data are the by-products of human daily life; therefore, they contain rich socioeconomic attributes. When these data meet with remote sensing, the promising trend is to fuse them to recognize urban functions, since the two kinds of data are complementary to each other (Liu et al., 2015). However, remote and social sensing data are significantly different in terms of sources and modalities. Generally, remote sensing images cover the study area. While social sensing data are place-based thus represented by points, polylines, or polygons. Besides, the features of social sensing data may be time-based (Tu et al., 2017), rather than space-based. The fusion of the two multi-source and multi-modal data is no trivial. The key challenge is to alleviate the modality gap and heterogeneity between them.
The emergence of deep learning has advanced many research fields, including image recognition (LeCun et al., 2015), time series classification (Fawaz et al., 2019), and etc. They have also greatly boosted the development of remote sensing (Zhu et al., 2017, Zhang et al., 2016). Significant improvements have been made in many tasks, such as hyperspectral image analysis (Li et al., 2019), image scene classification (Cheng et al., 2017), semantic labeling (Audebert et al., 2018), object detection (Cheng and Han, 2016), and image retrieval (Cao et al., 2020). The major advantages of deep learning approaches are the powerful abilities to automatically learn high-level features from large amount of data, which are vital to bridge the gap between different data modalities at feature level. Therefore, deep learning-based fusion methods are very potential to integrate the multi-source and multi-modal remote and social sensing data.
In this paper, to address the problem of urban region function recognition with cross-modal data sources, we propose an end-to-end deep learning-based multi-modal data fusion method to integrate remote sensing images and social sensing signatures. The two kinds of data are firstly fed into modal-specific encoders of residual convolutional neural network (CNN) and our proposed 1d CNN/LSTM-based network respectively to extract effective features, then those features are fused, and the outputs are finally put into fully connected layers and softmax layer to make predictions. We also propose two ancillary losses to further constrain the network training by drawing the two extracted multi-modal features nearer, which ensures the robustness against missing modalities. Open available datasets are exploited to evaluate our methods, and the results demonstrate their effectiveness and efficiency. In addition, thorough analysis of the methods and results have been conducted to provide insights into fusing the two types of data. Our contributions are summarized as follows:
- 1.
We propose an end-to-end deep multi-modal fusion method to effectively incorporate the multi-modal remote sensing images and social sensing signatures for urban region function recognition.
- 2.
We propose two effective neural networks to extract temporal signature features automatically. One is 1d CNN-based, and the other is LSTM-based, both of which can explicitly take temporal dependencies into account and effectively extract sequential-aware features.
- 3.
To address the data asynchronous problem, we propose two auxiliary losses, i.e. the cross-modal feature consistency (CMFC) loss and the cross-modal triplet (CMT) loss, to make the proposed multi-modal fusion network more robust to missing modalities without significantly impacting the performances.
- 4.
We have conducted extensive experiments on open available datasets to evaluate the effectiveness and efficiency of the proposed methods. We also analyze and discuss the results thoroughly to give insights into fusing the two multi-modal data.
The rest of the paper is organized as follows. In Section 2, we review related works on remote and social sensing for LULC classification and urban function recognition. In Section 3, we illustrate and formulate the problem of integrating remote sensing images and social sensing signatures for urban region function recognition. Section 4 elaborates the proposed methods to extract image and temporal signature features, as well as the multi-modal deep learning fusion for urban region function classification. Section 5 evaluates our methods on publicly available datasets and analyzes the results. In Section 6, we discuss about several important issues concerning the strength and possible improvement of the proposed methods. Finally, we conclude in Section 7.
Section snippets
Remote sensing for LULC classification
Land use and land cover classification through remote sensing imagery is a fundamental research topic in remote sensing community. Due to the limited spatial resolution of optical remote sensing imagery, pixel-centric spectral-based methods are the mainstream of traditional LULC classification works (Blaschke et al., 2014). However, the rapid development of high spatial resolution remote sensing imagery brings opportunities in digging into more complex spatial patterns, and geographic
Problem statement
The problem of urban region function recognition from multi-source and multi-modal remote sensing imagery and social sensing signature data is illustrated in Fig. 1. It can be formally defined as follows: for a region , given the satellite imagery and the social sensing signature of the region, the class that belongs to is to be predicted, as formulated in Eq. 1:where is the number of function types. Common functions include
Methodology
In this section, we describe the details of the proposed deep multi-modal fusion network, which is capable of integrating remote sensing images and social sensing temporal signature (TS) data for urban region function recognition.
Datasets
In this paper, the Urban Region Function Classification (URFC) datasets 1 are used to evaluate the proposed methods. The datasets are collected from urban areas in China. Two kinds of data are provided, i.e. satellite images and user visit data, with labels of 9 categories, i.e. residential area (res.), school (sch.), industrial park (ind.), railway station (rail.), airport (air.), park (park), shopping area (shop.), administrative district (adm.),
Capacity for sequential-aware modeling
Temporal dependency is one of the most important properties of the temporal signature data, which indicates the vital role of the sequential order of the data. It is significantly important to take full advantage of the information for time-series data classification. In order to demonstrate the capability of sequential-aware modeling of the proposed 1-d SPP-Net and LSTM-Net, extra experiments have been conducted. The sequential order of the input temporal signatures is randomly shuffled, and
Conclusions
Remote and social sensing data are complementary to each other as they possess their own unique characteristics. The integration of them has the potential to improve the accuracy of urban region function recognition. The key challenge is effective fusion of the two kinds of data. In this paper, we propose an end-to-end deep multi-modal fusion network to effectively fuse satellite imagery and social sensing signature data. The two data sources are put into modal-specific encoders of residual CNN
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
The author acknowledges the financial support from the International Doctoral Innovation Centre, Ningbo Education Bureau, Ningbo Science and Technology Bureau, and the University of Nottingham. This work was also supported by the UK Engineering and Physical Sciences Research Council [Grant No. EP/L015463/1], the National Natural Science Foundation of China (No. 41871329, 71961137003), the Shenzhen Scientific Research and Development Funding Program (No. JCYJ20170818092931604,
References (60)
- et al.
Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks
ISPRS J. Photogramm. Remote Sens.
(2018) - et al.
Geographic object-based image analysis – towards a new paradigm
ISPRS J. Photogramm. Remote Sens.
(2014) - et al.
Social functional mapping of urban green space using remote sensing and social sensing data
ISPRS J. Photogramm. Remote Sens.
(2018) - et al.
A survey on object detection in optical remote sensing images
ISPRS J. Photogramm. Remote Sens.
(2016) - et al.
Putting people in the picture: Combining big location-based social media data and remote sensing imagery for enhanced contextual urban information in Shanghai
Comput. Environ. Urban Syst.
(2017) - et al.
Urban land uses and traffic ‘source-sink areas’: Evidence from GPS-enabled taxi data in Shanghai
Landscape Urban Plann.
(2012) - et al.
Classification with an edge: Improving semantic image segmentation with boundary detection
ISPRS J. Photogramm. Remote Sens.
(2018) - et al.
Understanding urban landuse from the above and ground perspectives: A deep learning, multimodal solution
Remote Sens. Environ.
(2019) - et al.
Spatial variations in urban public ridership derived from GPS trajectories and smart card data
J. Transp. Geogr.
(2018) - et al.
Hierarchical semantic cognition for urban functional zones with VHR satellite images and POI data
ISPRS J. Photogramm. Remote Sens.
(2017)
Parcel-based urban land use classification in megacity using airborne LiDAR, high resolution orthoimagery, and Google Street View
Comput. Environ. Urban Syst.
An object-based convolutional neural network (OCNN) for urban land use classification
Remote Sens. Environ.
Functional urban land use recognition integrating multi-source geospatial data and cross-correlations
Comput. Environ. Urban Syst.
Joint deep learning for land cover and land use classification
Remote Sens. Environ.
Social sensing from street-level imagery: A case study in learning spatio-temporal urban mobility patterns
ISPRS J. Photogramm. Remote Sens.
Integrating aerial and street view images for urban land use classification
Remote Sens.
Enhancing remote sensing image retrieval using a triplet deep metric learning network
Int. J. Remote Sens.
Remote sensing image scene classification: Benchmark and state of the art
Proc. IEEE
A novel methodology to label urban remote sensing images based on location-based social media photos
Proc. IEEE
Geospatial big data: new paradigm of remote sensing applications
IEEE J. Sel. Top. Appl. Earth Obser. Remote Sens.
Deep learning for time series classification: a review
Data Min. Knowl. Disc.
Urban zoning using higher-order markov random fields on multi-view imagery data
Extracting urban functional regions from points of interest and human activities on location-based social networks
Trans. GIS
Identification of urban regions’ functions in Chengdu, China, based on vehicle trajectory data
PLOS One
Multisource and multitemporal data fusion in remote sensing: a comprehensive review of the state of the art
IEEE Geosci. Remote Sens. Mag.
Cited by (120)
A multimodal fusion framework for urban scene understanding and functional identification using geospatial data
2024, International Journal of Applied Earth Observation and GeoinformationIdentifying up-to-date urban land-use patterns with visual and semantic features based on multisource geospatial data
2024, Sustainable Cities and SocietyA comprehensive morphological classification scheme for local ventilation performance zones in spatially heterogeneous urban areas
2023, Developments in the Built EnvironmentSLWE-Net: An improved lightweight U-Net for Sargassum extraction from GOCI images
2023, Marine Pollution BulletinScreening the stones of Venice: Mapping social perceptions of cultural significance through graph-based semi-supervised classification
2023, ISPRS Journal of Photogrammetry and Remote SensingGeographic mapping with unsupervised multi-modal representation learning from VHR images and POIs
2023, ISPRS Journal of Photogrammetry and Remote Sensing