当前位置: X-MOL 学术arXiv.cs.DB › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching
arXiv - CS - Databases Pub Date : 2020-03-29 , DOI: arxiv-2003.13114
Venkata Vamsikrishna Meduri, Lucian Popa, Prithviraj Sen, Mohamed Sarwat

Entity Matching (EM) is a core data cleaning task, aiming to identify different mentions of the same real-world entity. Active learning is one way to address the challenge of scarce labeled data in practice, by dynamically collecting the necessary examples to be labeled by an Oracle and refining the learned model (classifier) upon them. In this paper, we build a unified active learning benchmark framework for EM that allows users to easily combine different learning algorithms with applicable example selection algorithms. The goal of the framework is to enable concrete guidelines for practitioners as to what active learning combinations will work well for EM. Towards this, we perform comprehensive experiments on publicly available EM datasets from product and publication domains to evaluate active learning methods, using a variety of metrics including EM quality, #labels and example selection latencies. Our most surprising result finds that active learning with fewer labels can learn a classifier of comparable quality as supervised learning. In fact, for several of the datasets, we show that there is an active learning combination that beats the state-of-the-art supervised learning result. Our framework also includes novel optimizations that improve the quality of the learned model by roughly 9% in terms of F1-score and reduce example selection latencies by up to 10x without affecting the quality of the model.

中文翻译:

实体匹配中主动学习方法的综合基准框架

实体匹配 (EM) 是一项核心数据清理任务,旨在识别同一现实世界实体的不同提及。主动学习是在实践中解决稀缺标记数据挑战的一种方法,通过动态收集由 Oracle 标记的必要示例并根据它们改进学习模型(分类器)。在本文中,我们为 EM 构建了一个统一的主动学习基准框架,允许用户轻松地将不同的学习算法与适用的示例选择算法相结合。该框架的目标是为从业者提供具体的指导方针,以确定哪些主动学习组合对 EM 有效。为此,我们对来自产品和出版领域的公开可用的 EM 数据集进行了全面的实验,以评估主动学习方法,使用各种指标,包括 EM 质量、#labels 和示例选择延迟。我们最令人惊讶的结果发现,使用较少标签的主动学习可以学习质量与监督学习相当的分类器。事实上,对于几个数据集,我们表明有一种主动学习组合可以击败最先进的监督学习结果。我们的框架还包括新颖的优化,在 F1 分数方面将学习模型的质量提高了大约 9%,并将示例选择延迟减少了 10 倍,而不会影响模型的质量。我们表明,有一种主动学习组合可以击败最先进的监督学习结果。我们的框架还包括新颖的优化,在 F1 分数方面将学习模型的质量提高了大约 9%,并将示例选择延迟减少了 10 倍,而不会影响模型的质量。我们表明,有一种主动学习组合可以击败最先进的监督学习结果。我们的框架还包括新颖的优化,在 F1 分数方面将学习模型的质量提高了大约 9%,并将示例选择延迟减少了 10 倍,而不会影响模型的质量。
更新日期:2020-03-31
down
wechat
bug