当前位置: X-MOL 学术ACM Trans. Knowl. Discov. Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
High-Value Token-Blocking: Efficient Blocking Method for Record Linkage
ACM Transactions on Knowledge Discovery from Data ( IF 4.0 ) Pub Date : 2021-07-21 , DOI: 10.1145/3450527
Kevin O’hare 1 , Anna Jurek-Loughrey 1 , Cassio De Campos 2
Affiliation  

Data integration is an important component of Big Data analytics. One of the key challenges in data integration is record linkage, that is, matching records that represent the same real-world entity. Because of computational costs, methods referred to as blocking are employed as a part of the record linkage pipeline in order to reduce the number of comparisons among records. In the past decade, a range of blocking techniques have been proposed. Real-world applications require approaches that can handle heterogeneous data sources and do not rely on labelled data. We propose high-value token-blocking (HVTB), a simple and efficient approach for blocking that is unsupervised and schema-agnostic, based on a crafted use of Term Frequency-Inverse Document Frequency. We compare HVTB with multiple methods and over a range of datasets, including a novel unstructured dataset composed of titles and abstracts of scientific papers. We thoroughly discuss results in terms of accuracy, use of computational resources, and different characteristics of datasets and records. The simplicity of HVTB yields fast computations and does not harm its accuracy when compared with existing approaches. It is shown to be significantly superior to other methods, suggesting that simpler methods for blocking should be considered before resorting to more sophisticated methods.

中文翻译:

高价值代币封锁:记录链接的高效封锁方法

数据集成是大数据分析的重要组成部分。数据集成的关键挑战之一是记录链接,即匹配代表相同现实世界实体的记录。由于计算成本,称为阻塞的方法被用作记录链接管道的一部分,以减少记录之间的比较次数。在过去的十年中,已经提出了一系列阻塞技术。现实世界的应用程序需要能够处理异构数据源并且不依赖于标记数据的方法。我们提出了高价值令牌阻塞 (HVTB),这是一种简单而有效的阻塞方法,它是无监督且与模式无关的,基于对词频-逆文档频率的精心设计。我们将 HVTB 与多种方法和一系列数据集进行比较,包括一个由科学论文的标题和摘要组成的新型非结构化数据集。我们在准确性、计算资源的使用以及数据集和记录的不同特征方面彻底讨论了结果。与现有方法相比,HVTB 的简单性产生了快速计算并且不会损害其准确性。它被证明明显优于其他方法,这表明在采用更复杂的方法之前应该考虑更简单的阻塞方法。
更新日期:2021-07-21
down
wechat
bug