当前位置:
X-MOL 学术
›
arXiv.cs.DB
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
A Survey of Blocking and Filtering Techniques for Entity Resolution
arXiv - CS - Databases Pub Date : 2019-05-15 , DOI: arxiv-1905.06167 George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, Themis Palpanas
arXiv - CS - Databases Pub Date : 2019-05-15 , DOI: arxiv-1905.06167 George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, Themis Palpanas
Efficiency techniques are an integral part of Entity Resolution, since its
infancy. In this survey, we organized the bulk of works in the field into
Blocking, Filtering and hybrid techniques, facilitating their understanding and
use. We also provided an in-dept coverage of each category, further classifying
the corresponding works into novel sub-categories. Lately, the efficiency
techniques have received more attention, due to the rise of Big Data. This
includes large volumes of semi-structured data, which pose challenges not only
to the scalability of efficiency techniques, but also to their core
assumptions: the requirement of Blocking for schema knowledge and of Filtering
for high similarity thresholds. The former led to the introduction of
schema-agnostic Blocking in conjunction with Block Processing techniques, while
the latter led to more relaxed criteria of similarity. Our survey covers these
new fields in detail, putting in context all relevant works.
中文翻译:
实体解析的阻塞和过滤技术综述
效率技术是实体解析不可或缺的一部分,因为它处于起步阶段。在本次调查中,我们将该领域的大量工作组织为阻塞、过滤和混合技术,以促进它们的理解和使用。我们还提供了每个类别的内部覆盖,进一步将相应的作品分类为新的子类别。最近,由于大数据的兴起,效率技术受到了更多关注。这包括大量的半结构化数据,这不仅对效率技术的可扩展性提出了挑战,而且对它们的核心假设提出了挑战:对模式知识的阻塞和对高相似性阈值的过滤的要求。前者导致引入了与块处理技术相结合的模式不可知块,而后者导致了更宽松的相似性标准。我们的调查详细涵盖了这些新领域,将所有相关工作放在了上下文中。
更新日期:2020-08-24
中文翻译:
实体解析的阻塞和过滤技术综述
效率技术是实体解析不可或缺的一部分,因为它处于起步阶段。在本次调查中,我们将该领域的大量工作组织为阻塞、过滤和混合技术,以促进它们的理解和使用。我们还提供了每个类别的内部覆盖,进一步将相应的作品分类为新的子类别。最近,由于大数据的兴起,效率技术受到了更多关注。这包括大量的半结构化数据,这不仅对效率技术的可扩展性提出了挑战,而且对它们的核心假设提出了挑战:对模式知识的阻塞和对高相似性阈值的过滤的要求。前者导致引入了与块处理技术相结合的模式不可知块,而后者导致了更宽松的相似性标准。我们的调查详细涵盖了这些新领域,将所有相关工作放在了上下文中。