当前位置: X-MOL 学术arXiv.cs.AI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
VIGOR: Cross-View Image Geo-localization beyond One-to-one Retrieval
arXiv - CS - Artificial Intelligence Pub Date : 2020-11-24 , DOI: arxiv-2011.12172
Sijie Zhu, Taojiannan Yang, Chen Chen

Cross-view image geo-localization aims to determine the locations of street-view query images by matching with GPS-tagged reference images from aerial view. Recent works have achieved surprisingly high retrieval accuracy on city-scale datasets. However, these results rely on the assumption that there exists a reference image exactly centered at the location of any query image, which is not applicable for practical scenarios. In this paper, we redefine this problem with a more realistic assumption that the query image can be arbitrary in the area of interest and the reference images are captured before the queries emerge. This assumption breaks the one-to-one retrieval setting of existing datasets as the queries and reference images are not perfectly aligned pairs, and there may be multiple reference images covering one query location. To bridge the gap between this realistic setting and existing datasets, we propose a new large-scale benchmark -- VIGOR -- for cross-View Image Geo-localization beyond One-to-one Retrieval. We benchmark existing state-of-the-art methods and propose a novel end-to-end framework to localize the query in a coarse-to-fine manner. Apart from the image-level retrieval accuracy, we also evaluate the localization accuracy in terms of the actual distance (meters) using the raw GPS data. Extensive experiments are conducted under different application scenarios to validate the effectiveness of the proposed method. The results indicate that cross-view geo-localization in this realistic setting is still challenging, fostering new research in this direction. Our dataset and code will be publicly available.

中文翻译:

VIGOR:超越一对一检索的跨视图图像地理定位

跨视图图像地理定位旨在通过与来自空中视图的带GPS标签的参考图像进行匹配来确定街景查询图像的位置。最近的工作在城市规模的数据集上取得了令人惊讶的高检索精度。但是,这些结果基于这样的假设,即存在一个完全位于任何查询图像的位置中心的参考图像,这不适用于实际情况。在本文中,我们以更现实的假设重新定义了该问题,即查询图像在感兴趣区域中可以是任意的,并且在查询出现之前就可以捕获参考图像。该假设打破了现有数据集的一对一检索设置,因为查询和参考图像不是完全对齐的对,并且可能有多个参考图像覆盖一个查询位置。为了弥合这种现实设置与现有数据集之间的差距,我们提出了一个新的大规模基准测试-VIGOR-用于超越一对一检索的跨视图图像地理定位。我们对现有的最新方法进行基准测试,并提出一种新颖的端到端框架,以从粗到精的方式本地化查询。除了图像级检索精度外,我们还使用原始GPS数据根据实际距离(米)评估定位精度。在不同的应用场景下进行了广泛的实验,以验证该方法的有效性。结果表明,在这种现实环境中进行跨视图地理定位仍然具有挑战性,从而推动了这一方向的新研究。我们的数据集和代码将公开可用。
更新日期:2020-11-25
down
wechat
bug