当前位置: X-MOL 学术VLDB J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Efficient and effective ER with progressive blocking
The VLDB Journal ( IF 2.8 ) Pub Date : 2021-03-13 , DOI: 10.1007/s00778-021-00656-7
Sainyam Galhotra , Donatella Firmani , Barna Saha , Divesh Srivastava

Blocking is a mechanism to improve the efficiency of entity resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. In this paper, we propose a new methodology of progressive blocking (pBlocking) to enable both efficient and effective ER, which works seamlessly across different entity cluster size distributions. pBlocking is based on the insight that the effectiveness–efficiency trade-off is revealed only when the output of ER starts to be available. Hence, pBlocking leverages partial ER output in a feedback loop to refine the blocking result in a data-driven fashion. Specifically, we bootstrap pBlocking with traditional blocking methods and progressively improve the building and scoring of blocks until we get the desired trade-off, leveraging a limited amount of ER results as a guidance at every round. We formally prove that pBlocking converges efficiently (\(O(n \log ^2 n)\) time complexity, where n is the total number of records). Our experiments show that incorporating partial ER output in a feedback loop can improve the efficiency and effectiveness of blocking by 5\(\times \) and 60%, respectively, improving the overall F-score of the entire ER process up to 60%.



阻塞是一种提高实体解析(ER)效率的机制,该机制旨在快速删减所有不匹配的记录对。但是,根据实体簇大小的分布,现有技术可能是(a)过于激进,以至于它们虽然有助于扩展规模,但会对ER效率产生不利影响,或者(b)过于宽容,有可能损害ER效率。在本文中,我们提出了一种渐进式阻塞pBlocking)的新方法,以实现高效的ER,并且可以在不同实体簇大小分布之间无缝地工作。pBlocking基于这样的见解,即只有当ER的输出开始可用时,才会显示出有效性与效率之间的权衡。因此,pBlocking在反馈回路中利用部分ER输出,以数据驱动的方式改善阻塞结果。具体来说,我们使用传统的阻止方法引导pBlocking,并逐步改善块的构建和得分,直到获得所需的权衡,并利用有限的ER结果作为每一轮的指导。我们正式证明pBlocking有效收敛(\(O(n \ log ^ 2 n)\)时间复杂度,其中n是记录总数。我们的实验表明,将部分ER输出合并到反馈环路中可以分别将阻塞的效率和有效性分别提高5 \(\ times \)和60%,从而将整个ER过程的整体F得分提高了60%。
