Exploration of OpenStreetMap missing built-up areas using twitter hierarchical clustering and deep learning in Mozambique
Introduction
Over the last decades, Volunteered Geographic Information (VGI) has been collected much more detailed, dynamic, and manifold than ever before from heterogeneous data sources, such as location-based services, global positioning systems (GPS), high-resolution earth observation data, and crowdsourced geographic information (Goodchild, 2007). OpenStreetMap (OSM) has been considered as the most active and widely used VGI platform. However, its reliability and accessibility remain variables due to the high diversity of volunteers’ mapping behavior (Barron et al., 2014). Data quality is regarded as first topic that suggests itself to anyone knowing VGI for the very first time (Goodchild and Glennon, 2010). Therefore, exploring the data quality and accessibility of OSM data requires further research towards developing sophisticated methods by integrating multiple social and geographical perspectives. Better quality-oriented awareness is of central essentiality to improve data quality and boost data application of OSM in general.
Among the existing works on investigating the quality of OSM data, there are mainly two streams. One common approach is to compare OSM data with authoritative reference data sets (Fan et al., 2014, Zielstra et al., 2013, Neis et al., 2012, Mooney and Corcoran, 2012), which are collected by federal agencies or commercial map providers. However, the acquisition of such reference data sets highly depends on social-economic factors (e.g., time, costs, and human labor restrictions), thus further limits the application of such extrinsic analysis approach. Herein, the intrinsic data analysis has been explored by looking into the historical data, where the intrinsic indicators show great potential to provide alternate indicators regarding the OSM data quality (Barron et al., 2014, Zhang et al., 2018, Jackson et al., 2013, Ostermann and Spinsanti, 2011). Given a data-sparse scenario where most of settlements and streets features are simply missing in OSM data, the established approaches become no longer adequate due to a lack of either reference or historical data. Therefore, robust and efficient quality indicators are necessary, which should be easily generated from widely available open geospatial data.
With the ever fast growth in the need of disaster response in worldwide, we have witnessed the increasing demands for accurate geographical information on the spatial distribution of human settlements. Examples include the 2008 Wenchuan earthquake, the 2010 Haiti earthquake, and the 2019 Cyclone Idai and Kenneth in Mozambique, which all caused tremendous damages, injuries, and loss of human lives. VGI, especially OpenStreetMap, has opened a new window in supporting such disaster response by establishing humanitarian mapping projects, with the motto of “mapping the most vulnerable places in the world” (Scholz et al., 2018). When considering the quality issue of OSM data, we should keep in mind that the quality may have diverse contexts, depending on the application to which the information is to be put (Goodchild and Glennon, 2010). While, the first priority in disaster response mapping scenario is to map as much as possible built-up areas with potential human settlements, which could be potentially vulnerable due to the disaster. In other words, the importance of positional accuracy is often outweighed by the completeness as a dimension of quality. Towards indicating the overall completeness of OSM data with a specialization in human settlements, we dedicate this work to the exploration of OSM missing built-up areas by integrating remote sensing and social sensing (Liu et al., 2015b) perspectives.
In this paper, we explore how social media and earth observation data can be used as reliable alternate sources to estimate OSM data quality in terms of missing built-up areas. Subsequently, this paper proposes a novel method by discovering the complementary values from hierarchical clustering of geo-tagged tweets and deep learning based built-up areas mapping for large-scale OSM data quality indication. By implementing the proposed method in Mozambique, Africa, we successfully explored a range of the OSM missing built-up areas, which deserves future detailed mapping by volunteers. The research questions answered in this paper are twofold.
- •
(RQ1): How can we discover the relationship between human active regions and geo-tagged tweets clusters at diverse scales of density, shape, and random cluster number?
- •
(RQ2): How can we further estimate and map OSM missing built-up areas within the discovered regions even with little prior knowledge?
The remainder of this paper is organized as follows: Section 2 introduces the relevant work of assessing OSM data quality, summarizing state-of-the-art studies. The overall methodology is presented in Section 3, followed by Section 4 that evaluates the performance of our OSM missing built-up areas mapping method in Mozambique. Section 5 discusses the results and provides suggestions for future work. Finally, Section 6 wraps up this paper with conclusions.
Section snippets
Assessing OpenStreetMap data quality
With the rapid growth in OSM community and application, the data quality becomes a crucial research topic than ever before, especially for those authoritative consumers (e.g., humanitarian organizations and local governments) (Fonte et al., 2015). Quality measurements of spatial data usually follow the principles of International Organization for Standardization (ISO) under ISO 19113 and ISO 19157, which consist of multiple elements such as accuracy, completeness, logical consistency, etc.
Exploring OpenStreetMap missing built-up areas
With respect to large-scale OSM missing built-up areas exploration, it would be too optimistic to rely on single module of either hierarchical tweets clustering or fine-tuned building detection neural networks, since both modules focus on different geographical scales and have their limitations. On the one hand, through spatial clustering of geo-tagged tweets, we could define the human active regions (HAR) as those areas where Twitter users have clustered and posted significant amount of
Study areas and data description
The Republic of Mozambique is selected as our study area (Fig. 4) in this paper, which was severely devastated by Cyclone Idai and Kenneth in 2019. This is the first time in recorded history that two strong tropical cyclones hit Mozambique in the same season. Nearly 2.2 million people in Mozambique have been put in an emergent situation of humanitarian assistance, such as, health care, nutrition, protection, and water and sanitation. Correspondingly, Humanitarian OpenStreetMap Team (HOT)
Discussions
Considering the HDBSCAN clustering results of geo-tagged tweets in Section 4.2, it is believed that our desired clusters should consist of highly individual search radiuses, shapes, and densities, which actually reveals the heterogeneous socioeconomic development level around Mozambique. This fact further distinguishes our work from existing social media data mining works, such as Steiger et al., 2016, Liu et al., 2019, which mainly focus on individual human activity patterns in an urban city
Conclusions
In this paper, we presented a novel method for exploring OSM missing built-up areas from a joint perspective of social sensing and remote sensing. The proposed method consists of two core modules: identifying human active regions with geo-tagged tweets clustering, and mapping built-up areas by deep learning from existing OSM buildings and satellite imagery. To conclude the results answering RQ1, we demonstrated the capability of HDBSCAN in deriving multi-density tweets clusters with random
Declaration of Competing Interest
No potential conflict of interest was reported by the author.
Acknowledgements
The authors would like to take this opportunity to thank the editors and reviewers for their valuable comments and suggestions. This work has been partly supported by the Klaus Tschira Stiftung (KTS) Heidelberg.
References (55)
- et al.
Extracting and understanding urban areas of interest using geotagged photos
Comput. Environ. Urban Syst.
(2015) - et al.
Understanding human activity patterns based on space-time-semantics
ISPRS J. Photogramm. Remote Sens.
(2016) - et al.
The impact of class imbalance in classification performance metrics based on the binary confusion matrix
Pattern Recogn.
(2019) - et al.
Correcting rural building annotations in openstreetmap using convolutional neural networks
ISPRS J. Photogramm. Remote Sens.
(2019) - et al.
Semantic segmentation of slums in satellite images using transfer learning on fully convolutional neural networks
ISPRS J. Photogramm. Remote Sens.
(2019) - et al.
A comprehensive framework for intrinsic openstreetmap quality analysis
Trans. GIS
(2014) - et al.
Density-based clustering based on hierarchical density estimates
- et al.
Hierarchical density estimates for data clustering, visualization, and outlier detection
ACM Trans. Knowl. Discov. Data
(2015) - et al.
The tasks of the crowd: A typology of tasks in geographic information crowdsourcing and a case study in humanitarian mapping
Remote Sens.
(2016) - Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. ImageNet: A Large-Scale Hierarchical Image...
Urban footprint processor—fully automated processing chain generating settlement masks from global data of the tandem-x mission
IEEE Geosci. Remote Sens. Lett.
The pascal visual object classes (voc) challenge
Int. J. Comput. Vision
Quality assessment for building footprints data on openstreetmap
Int. J. Geograph. Informat. Sci.
Usability of vgi for validation of land cover maps
Int. J. Geograph. Informat. Sci.
Quality assessment of the french openstreetmap dataset
Trans. GIS
Citizens as sensors: The world of volunteered geography
GeoJournal
Crowdsourcing geographic information for disaster response: a research frontier
Int. J. Digital Earth
How good is volunteered geographical information? a comparative study of openstreetmap and ordnance survey datasets
Environ. Plann. B: Plann. Des.
Measuring completeness of building footprints in openstreetmap over space and time
ISPRS Int. J. Geo-Informat.
Mapping human settlements with higher accuracy and less volunteer efforts by combining crowdsourcing and deep learning
Remote Sensing
Assessing completeness and spatial error of features in volunteered geographic information
ISPRS Int. J. Geo-Informat.
Algorithms for Clustering Data
Learning aerial image segmentation from online maps
IEEE Trans. Geosci. Remote Sens.
Cited by (29)
InstantCITY: Synthesising morphologically accurate geospatial data for urban form analysis, transfer, and quality control
2023, ISPRS Journal of Photogrammetry and Remote SensingCitation Excerpt :As the quality of urban data is becoming increasingly important (Basiri et al., 2019; Songchon et al., 2021; Grinberger et al., 2021), various methods to assess the completeness of features have been developed, with many of them focused on buildings (Senaratne et al., 2016). There are intrinsic methods, i.e. predicting the completeness of features based on the history of contributors or the arrangement of existing features (Zhou, 2017; Jacobs and Mitchell, 2020; Majic et al., 2021; Sundaram et al., 2021), and those that are extrinsic, requiring checking against another, usually authoritative, dataset representing the same features or proxies (Brovelli et al., 2016; Balducci, 2019; Li et al., 2020b). In the experiments, we investigate whether our method can also be used as a key component in spatial data quality assessment.
Global Building Morphology Indicators
2022, Computers, Environment and Urban SystemsCitation Excerpt :Data on building heights that are fully complete are in some cases available from authoritative (government) datasets in form of building footprints enriched with attribute information on heights or as point clouds obtained from airborne lidar, but these are limited to few geographic areas. Despite commendable advancements in large-scale mapping of buildings using satellite remote sensing techniques, there are still no global open datasets on heights of individual buildings, and many instances are generated at a coarse spatial resolution (e.g. average building height at the scale of a block), limited in coverage, and/or their positional accuracy may not be fully adequate for studying the urban form at high resolution (Chen, Zhang, Wong, & Ignatius, 2020; Esch et al., 2022; Frantz et al., 2021; Geis et al., 2019; Li et al., 2020; Li, Herfort, Huang, Zia, & Zipf, 2020; Tian, Tsendbazar, van Leeuwen, Fensholt, & Herold, 2022; Zhu et al., 2022). This limitation solely pertains to our input dataset (OSM) and geographies with completeness issues.
Leveraging OpenStreetMap and Multimodal Remote Sensing Data with Joint Deep Learning for Wastewater Treatment Plants Detection
2022, International Journal of Applied Earth Observation and GeoinformationHigh-resolution large-scale onshore wind energy assessments: A review of potential definitions, methodologies and future research needs
2022, Renewable EnergyCitation Excerpt :For example, much more recently, Broveli and Zamboni [70] evaluated OSM building completeness in Lombardy Italy and found the dataset to be 57% complete. Li et al. [71] identified 13 missing built-up areas in Mozambique's OSM data with a new approach combining social and remote sensing, which achieved an overall accuracy of more than 90% showing room for improving OSM's completeness. Another promising dataset in this context is the World Settlement Footprint, which has global coverage at 10 m resolution and to our knowledge has not yet been employed for global onshore wind potential analyses [72].
Automatic mapping of national surface water with OpenStreetMap and Sentinel-2 MSI data using deep learning
2021, International Journal of Applied Earth Observation and GeoinformationCitation Excerpt :OSM can offer higher precision and more semantic information of water features than most RS products, although the volunteered geographical information (VGI) nature of OSM leads to inevitable concerns regarding data quality, position accuracy, spatial consistency, and data completeness (Goodchild and Li, 2012; Barron et al., 2014; Fan et al., 2014). Previous studies (Xu et al., 2019; Li et al., 2020) have successfully combined ML methods with OSM data to predict multi-lane roads in China and to identify missing built-up areas in Mozambique. Furthermore, Scholz et al. (2018), Chen et al. (2018), Schmitt (2020) highlighted the potential of harvesting OSM data for more effective and efficient training of ML-based LULC classification methods.
Detecting inconsistent information in crowd-sourced street networks based on parallel carriageways identification and the rule of symmetry
2021, ISPRS Journal of Photogrammetry and Remote SensingCitation Excerpt :Moreover, the flexible way of semantic annotation encourages novel creation and use of the data (Ramm and Topf, 2010). Therefore, we claim that it is more sensible to keep the crowd-sourcing policy flexible while developing advanced methods to help identify problematic data as an assistant (e.g. Li et al., 2020). Such methods could be deployed in the crowd-sourcing platforms as data editor plug-ins during editing, or when the data are to be used in professional applications as a quality assurance mechanism.