当前位置: X-MOL 学术J. Comput. Graph. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
d-blink: Distributed End-to-End Bayesian Entity Resolution
Journal of Computational and Graphical Statistics ( IF 1.4 ) Pub Date : 2020-09-23 , DOI: 10.1080/10618600.2020.1825451
Neil G. Marchant 1 , Andee Kaplan 2 , Daniel N. Elazar 3 , Benjamin I. P. Rubinstein 1 , Rebecca C. Steorts 4
Affiliation  

Entity resolution (ER) (record linkage or de-duplication) is the process of merging together noisy databases, often in the absence of a unique identifier. A major advancement in ER methodology has been the application of Bayesian generative models. Such models provide a natural framework for clustering records to unobserved (latent) entities, while providing exact uncertainty quantification and tight performance bounds. Despite these advancements, existing models do not scale to realistically-sized databases (larger than 1000 records) and they do not incorporate probabilistic blocking. In this paper, we propose "distributed Bayesian linkage" or d-blink -- the first scalable and distributed end-to-end Bayesian model for ER, which propagates uncertainty in blocking, matching and merging. We make several novel contributions, including: (i) incorporating probabilistic blocking directly into the model through auxiliary partitions; (ii) support for missing values; (iii) a partially-collapsed Gibbs sampler; and (iv) a novel perturbation sampling algorithm (leveraging the Vose-Alias method) that enables fast updates of the entity attributes. Finally, we conduct experiments on five data sets which show that d-blink can achieve significant efficiency gains -- in excess of 300$\times$ -- when compared to existing non-distributed methods.

中文翻译:

d-blink:分布式端到端贝叶斯实体解析

实体解析 (ER)(记录链接或重复数据删除)是将嘈杂的数据库合并在一起的过程,通常在没有唯一标识符的情况下。ER 方法的一个重大进步是贝叶斯生成模型的应用。这样的模型为将记录聚类到未观察到的(潜在)实体提供了一个自然的框架,同时提供了精确的不确定性量化和严格的性能界限。尽管取得了这些进步,但现有模型无法扩展到实际大小的数据库(大于 1000 条记录),并且它们没有包含概率阻塞。在本文中,我们提出了“分布式贝叶斯链接”或 d-blink——第一个用于 ER 的可扩展和分布式端到端贝叶斯模型,它传播了阻塞、匹配和合并中的不确定性。我们做出了一些新的贡献,包括:(i) 通过辅助分区将概率分块直接合并到模型中;(ii) 支持缺失值;(iii) 部分折叠的 Gibbs 采样器;(iv) 一种新颖的扰动采样算法(利用 Vose-Alias 方法),可以快速更新实体属性。最后,我们对五个数据集进行了实验,表明与现有的非分布式方法相比,d-blink 可以实现显着的效率提升——超过 300 美元\倍。
更新日期:2020-09-23
down
wechat
bug