当前位置: X-MOL 学术J. Comput. Graph. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Scalable Feature Matching Across Large Data Collections
Journal of Computational and Graphical Statistics ( IF 1.4 ) Pub Date : 2022-06-02 , DOI: 10.1080/10618600.2022.2074429
David Degras 1
Affiliation  

Abstract

This article is concerned with matching feature vectors in a one-to-one fashion across large collections of datasets. Formulating this task as a multidimensional assignment problem with decomposable costs (MDADC), we develop fast algorithms with time complexity roughly linear in the number n of datasets and space complexity a small fraction of the data size. These remarkable properties hinge on using the squared Euclidean distance as dissimilarity function, which can reduce (n2) matching problems between pairs of datasets to n problems and enable calculating assignment costs on the fly. To our knowledge, no other method applicable to the MDADC possesses these linear scaling and low-storage properties necessary to large-scale applications. In numerical experiments, the novel algorithms outperform competing methods and show excellent computational and optimization performances. An application of feature matching to a large neuroimaging database is presented. The algorithms of this article are implemented in the R package matchFeat available at github.com/ddegras/matchFeat. Supplementary materials for this article are available online.



中文翻译:

跨大型数据集合的可扩展特征匹配

摘要

本文关注的是在大型数据集集合中以一对一的方式匹配特征向量。将此任务表述为具有可分解成本 (MDADC) 的多维分配问题,我们开发了时间复杂度与数据集数量n大致呈线性关系且空间复杂度仅为数据大小的一小部分的快速算法。这些显着的特性取决于使用平方欧几里得距离作为相异函数,这可以减少(n2个)将数据集对之间的问题与n 个问题进行匹配,并能够即时计算分配成本。据我们所知,没有其他适用于 MDADC 的方法具有大规模应用所需的这些线性缩放和低存储特性。在数值实验中,新算法优于竞争方法,并显示出出色的计算和优化性能。介绍了特征匹配在大型神经影像学数据库中的应用。本文的算法在github.com/ddegras/matchFeat上的 R 包 matchFeat 中实现。本文的补充材料可在线获取。

更新日期:2022-06-02
down
wechat
bug