当前位置: X-MOL 学术Biometrika › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Theoretical limits of microclustering for record linkage
Biometrika ( IF 2.7 ) Pub Date : 2018-03-19 , DOI: 10.1093/biomet/asy003
J E Johndrow 1 , K Lum 2 , D B Dunson 3
Affiliation  

&NA; There has been substantial recent interest in record linkage, where one attempts to group the records pertaining to the same entities from one or more large databases that lack unique identifiers. This can be viewed as a type of microclustering, with few observations per cluster and a very large number of clusters. We show that the problem is fundamentally hard from a theoretical perspective and, even in idealized cases, accurate entity resolution is effectively impossible unless the number of entities is small relative to the number of records and/or the separation between records from different entities is extremely large. These results suggest conservatism in interpretation of the results of record linkage, support collection of additional data to more accurately disambiguate the entities, and motivate a focus on coarser inference. For example, results from a simulation study suggest that sometimes one may obtain accurate results for population size estimation even when fine‐scale entity resolution is inaccurate.

中文翻译:

记录链接微聚类的理论限制

&NA; 最近人们对记录链接产生了浓厚的兴趣,人们试图从一个或多个缺少唯一标识符的大型数据库中对属于相同实体的记录进行分组。这可以看作是一种微聚类,每个聚类的观察很少,而聚类的数量非常多。我们表明,从理论的角度来看,这个问题从根本上是困难的,即使在理想化的情况下,准确的实体解析实际上是不可能的,除非实体的数量相对于记录的数量来说很小和/或不同实体的记录之间的分离非常严重大。这些结果表明在解释记录链接结果时采取保守态度,支持收集额外数据以更准确地消除实体的歧义,并激发对更粗略推理的关注。
更新日期:2018-03-19
down
wechat
bug