Machine learning for cross-gazetteer matching of natural features,International Journal of Geographical Information Science

当前位置： X-MOL 学术 › Int. J. Geograph. Inform. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Machine learning for cross-gazetteer matching of natural features
International Journal of Geographical Information Science ( IF 4.3 ) Pub Date : 2019-04-22 , DOI: 10.1080/13658816.2019.1599123
Elise Acheson ₁ , Michele Volpi ₂ , Ross S. Purves ₁

Affiliation

ABSTRACT Defining and identifying duplicate records in a dataset is a challenging task which grows more complex when the modeled entities themselves are hard to delineate. In the geospatial domain, it may not be clear where a mountain, stream, or valley ends and begins, a problem carried over when such entities are catalogued in gazetteers. In this paper, we take two gazetteers, GeoNames and SwissNames3D, and perform matching – identifying records in each that are about the same entity – across a sample of natural feature records. We first perform rule-based matching, establishing competitive results, then apply machine learning using Random Forests, a method well-suited to the matching task. We report on the performance of a wider array of matching features than has been previously studied, including domain-specific ones such as feature type, land cover class, and elevation. Our results show an increase in performance using machine learning over rules, with a notable performance gain from considering feature types, but negligible gains from other specialized matching features. We argue that future work in this area should strive to be more reproducible and report results on a realistic testing pipeline including candidate selection, feature extraction, and classification.

中文翻译：

自然特征跨地名词典匹配的机器学习

摘要定义和识别数据集中的重复记录是一项具有挑战性的任务，当建模实体本身难以描绘时，它变得更加复杂。在地理空间领域，可能不清楚山脉、溪流或山谷的终点和起点，当这些实体在地名词典中编目时，这个问题就会继续存在。在本文中，我们采用两个地名词典，GeoNames 和 SwissNames3D，并在自然特征记录样本中执行匹配——识别每个中关于同一实体的记录。我们首先执行基于规则的匹配，建立竞争结果，然后使用随机森林应用机器学习，这是一种非常适合匹配任务的方法。我们报告了比以前研究的更广泛的匹配特征的性能，包括特定领域的特征，例如特征类型，土地覆盖等级和海拔。我们的结果表明，使用机器学习而不是规则提高了性能，通过考虑特征类型获得了显着的性能提升，但其他专门匹配特征带来的收益可以忽略不计。我们认为，该领域的未来工作应努力提高可重复性，并在真实的测试管道上报告结果，包括候选者选择、特征提取和分类。

更新日期：2019-04-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11