Large Scale Record Linkage in the Presence of Missing Data,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Large Scale Record Linkage in the Presence of Missing Data
arXiv - CS - Databases Pub Date : 2021-04-19 , DOI: arxiv-2104.09677
Thilina Ranbaduge, Peter Christen, Rainer Schnell

Record linkage is aimed at the accurate and efficient identification of records that represent the same entity within or across disparate databases. It is a fundamental task in data integration and increasingly required for accurate decision making in application domains ranging from health analytics to national security. Traditional record linkage techniques calculate string similarities between quasi-identifying (QID) values, such as the names and addresses of people. Errors, variations, and missing QID values can however lead to low linkage quality because the similarities between records cannot be calculated accurately. To overcome this challenge, we propose a novel technique that can accurately link records even when QID values contain errors or variations, or are missing. We first generate attribute signatures (concatenated QID values) using an Apriori based selection of suitable QID attributes, and then relational signatures that encapsulate relationship information between records. Combined, these signatures can uniquely identify individual records and facilitate fast and high quality linking of very large databases through accurate similarity calculations between records. We evaluate the linkage quality and scalability of our approach using large real-world databases, showing that it can achieve high linkage quality even when the databases being linked contain substantial amounts of missing values and errors.

中文翻译：

存在缺失数据时的大规模记录链接

记录链接旨在准确，高效地标识代表不同数据库内或不同数据库中相同实体的记录。这是数据集成中的一项基本任务，对于从健康分析到国家安全的应用程序领域中的准确决策，要求也越来越高。传统的记录链接技术计算准标识（QID）值（例如人的姓名和地址）之间的字符串相似度。但是，错误，变异和丢失的QID值可能导致链接质量降低，因为无法准确计算记录之间的相似度。为了克服这一挑战，我们提出了一种新颖的技术，即使QID值包含错误或变化或丢失，也可以准确地链接记录。我们首先使用基于Apriori的适当QID属性选择生成属性签名（串联的QID值），然后使用封装记录之间的关系信息的关系签名。这些签名结合在一起，可以唯一地标识单个记录，并通过记录之间的精确相似性计算，促进大型数据库的快速和高质量链接。我们使用大型现实数据库评估了我们方法的链接质量和可伸缩性，表明即使被链接的数据库包含大量的缺失值和错误，它也可以实现较高的链接质量。这些签名可以唯一地标识单个记录，并通过记录之间的准确相似度计算来促进大型数据库的快速和高质量链接。我们使用大型现实数据库评估了我们方法的链接质量和可伸缩性，表明即使被链接的数据库包含大量的缺失值和错误，它也可以实现较高的链接质量。这些签名可以唯一地标识单个记录，并通过记录之间的准确相似度计算来促进大型数据库的快速和高质量链接。我们使用大型现实数据库评估了我们方法的链接质量和可伸缩性，表明即使被链接的数据库包含大量的缺失值和错误，它也可以实现较高的链接质量。

更新日期：2021-04-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>