Detecting inconsistent information in crowd-sourced street networks based on parallel carriageways identification and the rule of symmetry

https://doi.org/10.1016/j.isprsjprs.2021.03.014Get rights and content

Abstract

Crowd-sourced geographic information has great potential in scientific and public domains, and is recently under consideration by geospatial professionals as an alternative to traditional spatial data collection. The success, however, implies a need to build long-term reliance on the crowd-sourcing projects, and poses growing concern over the quality of the constantly evolving data. In general, we aim to develop an approach that uses geographic rules to identify inconsistent information in street networks without relying on external sources. This paper focuses on a more challenging sub-process that aims to identifying inconsistent information using the rule of symmetry. That is, information (e.g. name, class, speed limit, etc.) in parallel carriageways (e.g. divided highways) always constrains each other. The process starts with a clustering of related streets into well-defined or ambiguous situations using a DBSCAN-inspired technique; then two pairing strategies are designed for both situations. To address the challenging problem of pairing carriageways in ambiguous situations, three pairing algorithms (stroke-based, tree-based, and mixed) are devised based on the idea of using expanded ‘receptive field’ to disentangle the ambiguities; each has a focus on efficiency, effectiveness, or their tradeoff. Evaluating the algorithms against 7 selected datasets shows that all three algorithms reached satisfactory performance (F1-score > 92%) for ambiguous situations, and much higher accuracy for the whole datasets. Then, we applied our approach to over 40 datasets worldwide and detected inconsistencies (i.e. dissimilar values in paired carriageways) in crowd-sourced and authoritative street networks. We evaluate the identified inconsistencies, analyze the possibilities of our approach in suggesting corrections to problematic data, and discuss its effectiveness, issues, and future directions. We thereby demonstrate that the proposed approach is effective for quality assurance, and can be used to assure the quality of crowd-sourced and authoritative mapping projects during their evolution without relying on ground-truth.

Introduction

In recent years, scientific and public domains have witnessed the proliferation of crowd-sourced geographic information (Goodchild, 2007, See et al., 2016, Yan et al., 2020). Such emerging data sources have proven successful in collecting timely and diverse geographic information (Goodchild and Li, 2012) and have shown great potential in many domains like natural resource management, urban planning, and disaster response (Schultz et al., 2017, Spyratos and Stathakis, 2018, Feng et al., 2020). A critical problem is data quality. Relying on the Linus’ law from the open source domain, the crowd-sourcing mapping community believes that the quality would eventually be improved by the crowd (Haklay et al., 2010, Foody et al., 2015). But there is not always a guarantee. For example, some mistakes may never be corrected if no one has payed attention to it (Goodchild and Li, 2012), suggesting that such a quality control mechanism can be inefficient. Hence more effective and efficient methods for quality assurance are anticipated.

After years of development, the quality of some crowd-sourced data is getting close to that of authoritative datasets, at least in many urban areas (Haklay, 2010, Zielstra and Zipf, 2010, Basiri et al., 2019). As a result, GIS practitioners and data providers started to look at the feasibility of incorporating such data into professional domains (Parker et al., 2012, Mooney and Morley, 2014, Olteanu-Raimond et al., 2016). For example, national mapping agencies in European countries began to study the use of such information in e.g. change detection to accelerate their production lines (Olteanu-Raimond et al., 2016). North American countries also reported their progress on updating authoritative data based on crowd-sourced data (Elwood et al., 2012, Begin, 2014). However, as different quality control approaches are taken by professional and crowd-sourcing communities, the debate as to whether and how to integrate crowd-sourced data into the production lines have never ceased (Elwood et al., 2012, Zhang et al., 2018b).

Therefore, the central concern around the crowd-sourced geo-information seems to move from how good/useful it is in various applications (Hacklay 2010; Basiri et al., 2019) towards can we trust and depend on it in the long run (an evolving perspective)? And are there approaches to assuring the quality of the constantly evolving data more efficiently? Previously, quality assessment of crowd-sourced geo-data has been extensively studied, many of which aimed at deriving a general description of the data quality by comparing them with reference data (Haklay, 2010, Zielstra and Zipf, 2010, Fan et al., 2014). Nevertheless, although such a static and descriptive approach is useful, it does not provide insights into where and how the data can be improved during the evolution of the crowd-sourcing projects. Further, the availability of reference data makes such assessments costly and difficult to scale. We therefore explore means to identify and improve problematic data efficiently and economically that is crucial for the crowd-sourcing platforms to be continuously successful.

Besides its current quality status, two extra problems still hold back the emerging data from wider and serious applications. First, there is a lack of specifications for data (Brando and Bucher, 2010). Contributors of different backgrounds are free to model geographic entities and annotate them in their own ways, bringing about data inconsistencies. For example, a dual carriageway can be represented by a single-line or double-line geometry, and both representations may coexist, probably because multiple users have worked on different parts of the same road; likewise, conflicting annotations may be attached to different parts of an object, bringing about mistakes in data use (Zhang and Ai, 2015).

Second, malicious edits, or vandalism, are another source that undermines the credibility of crowd-sourced data or peer production today (Ballatore, 2014). The fact that the data are so amendable to unrestricted user edits explains traditional data providers’ concern against crowd-sourced data (TomTom, 2012). For example, the direction and annotations of crowd-sourced street networks can easily be altered, leading to unexpected consequences in routing and navigation.

Nevertheless, it is widely perceived by the crowd-sourcing community that, the less restricting data policy is essentially a welcoming factor for newcomers to join the community and to make sustained contributions, and hence the key behind the success of such projects (Ramm and Topf, 2010, Halfaker et al., 2013). Moreover, the flexible way of semantic annotation encourages novel creation and use of the data (Ramm and Topf, 2010). Therefore, we claim that it is more sensible to keep the crowd-sourcing policy flexible while developing advanced methods to help identify problematic data as an assistant (e.g. Li et al., 2020). Such methods could be deployed in the crowd-sourcing platforms as data editor plug-ins during editing, or when the data are to be used in professional applications as a quality assurance mechanism.

In this paper, we focus on street networks – fundamental datasets in many applications such as transportation, human mobility studies, and daily navigation and location-based services (Jiang et al., 2009, Attard et al., 2016, Yan et al., 2020). However, crowd-sourced street networks constantly suffer from the above-mentioned issues, which cause the data to became inconsistent and bring difficulties in future applications.

Our general approach is to use geographic rules in network space (i.e. rules of continuity and symmetry) as constraints to automatically identify inconsistent data. The rule of continuity is relatively easier and has been addressed in Zhang and Ai (2015). In this paper, we focus on the other more challenging sub-process of the general approach using the rule of symmetry. That is, information of parallel carriageways (e.g. divided highways) always constrains each other (Sect. 1.2). The general approach is a geographic approach to quality assurance (Goodchild and Li, 2012) that is able to identify counterfactual information with geographic knowledge, without referring to ground-truth or reference data. In addition, our approach has the potential to suggest possible corrections to inconsistent data according to spatial context. Although the proposed inconsistency detection is motivated by crowd-sourced data, it is also applicable to authoritative/commercial datasets. Before proceeding to technical details, we will first review necessary literature, concepts related to the geographic rules used, and analyze difficulties in identifying inconsistent street information in complex situations.

In general, quality analysis can be performed for at least two purposes: quality assessment or quality assurance. The former is a descriptive approach and aims to derive a general description/summary of the data quality (e.g. good, bad, or comparable), while in the latter (semi-)automated methods are developed to help identify less/not qualified data and improve the quality.

Further, quality analysis can be divided into extrinsic and intrinsic approaches (Fig. 1). In extrinsic assessments, target dataset is compared with authoritative or third-party reference data, ground truth, or local knowledge. For example, in early studies of the OpenStreetMap (OSM) projects, similarities and differences between OSM and professional datasets were analyzed at the subdivision (grid) level (Haklay et al., 2010, Zielstra and Zipf, 2010, Fan et al., 2014) and the feature-level (Girres and Touya, 2010). The grid-level analysis is effective at presenting data quality for relatively larger areas (a descriptive approach). Although the feature-level evaluation can provide fine-grained information as to which feature is problematic, it requires sophisticated data matching to support the analysis between heterogeneous datasets (Koukoletsos et al., 2012, Zhang et al., 2014).

The above extrinsic quality assessments show that the completeness and accuracy of OSM is more and more comparable to professional data (Olteanu-Raimond et al., 2016). However, due to the accessibility of reference data, such methods can hardly be carried out worldwide. Besides, high completeness and accuracy do not guarantee the usefulness of data in practice (Over et al., 2010). For example, even if a street network covers the whole area, topological connectivity crucial for routing is not always satisfactory (Mondzech and Sester, 2011, Zhang and Ai, 2015). So, application requirements seem to be more important for practical reasons.

Intrinsic approaches, on the contrary, are much easier to implement for large scale quality analysis, as no reference or ground truth is required. Intrinsic assessments can be either descriptive or instructive. Mooney et al. (2010) for instance used spatial sampling and annotation rate to describe the general properties of OSM. By comparing the intrinsic quality measures with some specifications (Fig. 1), the analysis then becomes instructive for quality improvement. Given the lack of systematic specifications in crowd-sourced geographic information, partial specifications can always be exploited. For example, ramps or roundabouts need to be one-way. Many such rules are integrated in existing quality assurance tools1 in OSM to identify inconsistent data.

For quality assurance, Goodchild and Li (2012) have distinguished between the crowd-sourcing, social, and geographic approaches. The crowd-sourcing approach, which relies on Linus’ Law, is the default approach used by many collaborative projects, where contributors report/fix bugs found by themselves or others. Yet the effectiveness of this approach depends on the number, experience, and vibrancy of the users in an area. Likewise, social approach relies on a hierarchy of reviewers, each with different roles, to screen the data uploaded, which is also adopted by collaborative mapping projects such as OSM and Wikimapia. These two approaches rely eventually on users with local knowledge to improve data quality.

More recently, crowd-sourcing imaging platforms (e.g. Mapillary2) are emerging. Although deep learning has been adopted to recognize street-level objects from the images, human efforts are needed to incorporate the recognized information into other crowd-sourced mapping projects for quality assurance (Juhász and Hochmair, 2017). Hence, this is essentially a crowd-sourcing approach relying on external sources, and the whole process is less efficient, though such a process can be made faster (e.g. Vargas Muñoz, et al., 2020).

The geographic approach, on the other hand, compares facts, relations, and patterns with geographic laws or knowledge, and aims to identify situations that are contradictory to or not consistent with that knowledge. By exploiting structures within physical or social world (e.g. correlations, associations, autocorrelations, etc.), one is able to develop automated methods for quality assurance without referring to ground truth or third-party data (Fig. 1). For example, fractal laws, Horton’s law, and Central Place Theory have been tested on rivers, river systems, shorelines, and settlements (Goodchild and Li, 2012). Although these statistical/aggregated laws are difficult to be formalized into rules at the individual level, this idea of testing the adherence or violation to specific structures/laws within data offers a new way of examining and improving data quality (Zhang et al., 2018a).

For example, the ideas of correlations and associations have been shown to be effective in identifying the inconsistency of levels of detail, relational and semantic inconsistency (Touya and Brando, 2013, Ai et al., 2014, Zhang and Ai, 2015). As a social counterpart, correlations between user editing activity and data quality were also examined using behavior/historical data (Fig. 1). A first example is that editing behavior patterns can be exploited to detect vandalism (Neis et al., 2012). Likewise, researchers studied the number of edits in relation to data quality, and found that the quality seems to be proportional to the level of editing activity, up to a ceiling level (Haklay et al., 2010, Mooney and Corcoran, 2012, Severinsen et al., 2019). Therefore, they suggest using the level of user activity in an area to indicate the data quality of that area. Like the descriptive approach, however, such an indicator may only reflect the general quality of the area, and would find difficulties when it is used to identify and improve problematic data.

Here, we introduce two geographic rules that can be used to identify inconsistent information in street networks. The first one is the rule of continuity, which states that information (i.e. attribute values) is highly positively autocorrelated in smoothly connected streets. The smoothly connected streets are called natural streets (Jiang et al., 2008). This means that two segments that are close to each other in a natural street have similar properties, and hence a segment can be compared with its neighbors for inconsistency in the scope of a natural street (Zhang and Ai, 2015). This is analogous to the Tobler’s First Law (TFL) of geography in network space. The second is the rule of symmetry, meaning that information is highly mirrored (i.e. same/similar attribute values, different traffic directions) in dual-carriageways. Zhang et al. (2018a) show that the rules can be observed worldwide and are statistically significant, allowing them to be formalized for quality assurance.

The sub-process of using the rule of continuity, which works on the entire network, is relatively easier and has been addressed in Zhang and Ai (2015). This paper focuses on the more challenging sub-process: using the rule of symmetry to detect inconsistencies in parallel carriageways. To examine if opposite carriageways of a divided highway have consistent information, one needs to first identify pairs of parallel streets that form the same highway; such information is not explicitly encoded. This pairing process is challenging (Fig. 2a) partly due to the difficulty in determining the parallel relation in real data. Moreover, when multiple streets run parallel the pairing can be highly ambiguous (inset in Fig. 2a). Depending on how multiple streets converge and diverge at two ends of an ambiguous region, we identify 6 basic types of ambiguous configurations (Fig. 2b) for evaluation purposes (Sect. 3.2). In general, ambiguous situations are harder to pair than well-defined situations (arrows in Fig. 2a). Here we propose a pairing procedure that exploits spatial relations in street networks to address the recognition of divided highways in ambiguous situations. Note that, although the sub-process addressed in this paper works on parts of the street network, the general inconsistency detection process works on the entire network.

We use OSM data as running examples to explain the steps in our methodology. The major problem in the pairing is that, algorithms normally work with limited ‘receptive field’, a term used in deep learning to refer to the spatial scale of analysis (Araujo et al, 2019). This restricts the algorithms’ ability to incorporate non-local information and to disentangle the ambiguities. The basic idea here is to expand this ‘receptive field’ along the network, so that more evidence can be accumulated to improve the pairing performance (Sect. 2). Specifically, three pairing algorithms are devised. We evaluate the performance of the algorithms and the feasibility of our approach in detecting inconsistencies in street networks (Sect. 3). Then we discuss the general applicability, remaining issues, and options for enhancement in Sect. 4, and conclude this research in Sect. 5.

Section snippets

Observed principles and the general framework

To identify parallel streets such as divided highways in a street network, one generally looks for street segments that run parallel and are close enough to each other (Zhang et al., 2018a). Such a local strategy is useful for recognizing well-defined divided highways (e.g. Fig. 2a). However, closeness and degree of parallelism are not sufficient for the pairing in ambiguous situations. For instance, parallel streets that are closer to each other may be part of different highways (e.g. four

Experimental design

We implemented the proposed carriageway pairing algorithms and the inconsistency detection as an ArcGIS plug-in using C#. For their evaluation, we visually compared the automatically identified divided highways with web maps such as ESRI World Topo and Google Maps. Because the map services are highly detailed and have street names displayed, such a visual comparison is rather reliable and no need for human interpretations. Seven datasets were used (Table 2) to evaluate our pairing algorithms

The effectiveness and general applicability of our approach

The core of the pairing algorithms lies in the expansion of ‘receptive field’, so they can integrate larger spatial contexts to help disentangle the ambiguous pairing relations. First, our results confirm the effectiveness of the idea. With only measures of closeness and parallelism, the pairing in ambiguous situations can hardly be addressed (e.g. Zhang et al., 2018a). Second, the symmetrical pairing used to refine the pairing of type 1 configurations (Sect. 2.5.4) proved to be useful. Without

Conclusions

Using the proposed carriageway pairing methods, together with the validated principle of symmetry, we show that it is feasible to automatically detect the invisible inconsistences in street networks that are otherwise costly and extremely inefficient for humans to visually inspect the data. Since our approach does not rely on ground truth or third-party data, it is more affordable than other quality assurance approaches. This is demonstrated by the inconsistency detection in>40 cities

Declaration of Competing Interest

The authors declared that there is no conflict of interest.

References (48)

  • A. Ballatore

    Defacing the Map: Cartographic Vandalism in the Digital Commons

    Cartographic J.

    (2014)
  • A. Basiri et al.

    Crowdsourced geospatial data quality: challenges and future directions

    International J. Geographical Information Science

    (2019)
  • Begin, D. (2014) Towards integrating VGI and national mapping agency operations: A Canadian case study. In: Proceedings...
  • C. Brando et al.

    Quality in user generated spatial content: a matter of specifications

    Proceedings of the 13th AGILE international conference on geographic information science

    (2010)
  • S. Elwood et al.

    Researching volunteered geographic information: spatial data, geographic research, and new social practice

    Ann. Assoc. Am. Geogr.

    (2012)
  • M. Ester et al.

    A density-based algorithm for discovering clusters in large spatial databases with noise

    Proc. Second International Conference Knowledge Discovery Databases Data Mining

    (1996)
  • H. Fan et al.

    Quality assessment for building footprints data on OpenStreetMap

    Int. J. Geogr. Inf. Sci.

    (2014)
  • G.M. Foody et al.

    Accurate attribute mapping from volunteered geographic information: issues of volunteer quantity and quality

    Cartographic J.

    (2015)
  • J.F. Girres et al.

    Quality assessment of the French OpenStreetMap dataset

    Transactions in GIS

    (2010)
  • M.F. Goodchild

    Citizens as sensors: The world of volunteered geography

    GeoJournal

    (2007)
  • M. Haklay et al.

    How many volunteers does it take to map an area well? the validity of Linus’s Law to volunteered geographic information

    Cartographic J.

    (2010)
  • M. Haklay

    How good is volunteered geographical information? a comparative study of OpenStreetMap and ordnance survey datasets

    Environ Plan

    (2010)
  • A. Halfaker et al.

    The rise and decline of an open collaboration system: How Wikipedia’s reaction to popularity is causing its decline

    American Behavioral Scientist

    (2013)
  • B. Jiang et al.

    Self-organized natural roads for predicting traffic flow: a sensitivity study

    J. Stat. Mech: Theory Exp.

    (2008)
  • View full text