Record fusion: A learning approach,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Record fusion: A learning approach
arXiv - CS - Databases Pub Date : 2020-06-18 , DOI: arxiv-2006.10208
Alireza Heidari, George Michalopoulos, Shrinu Kushagra, Ihab F. Ilyas, Theodoros Rekatsinas

Record fusion is the task of aggregating multiple records that correspond to the same real-world entity in a database. We can view record fusion as a machine learning problem where the goal is to predict the "correct" value for each attribute for each entity. Given a database, we use a combination of attribute-level, recordlevel, and database-level signals to construct a feature vector for each cell (or (row, col)) of that database. We use this feature vector alongwith the ground-truth information to learn a classifier for each of the attributes of the database. Our learning algorithm uses a novel stagewise additive model. At each stage, we construct a new feature vector by combining a part of the original feature vector with features computed by the predictions from the previous stage. We then learn a softmax classifier over the new feature space. This greedy stagewise approach can be viewed as a deep model where at each stage, we are adding more complicated non-linear transformations of the original feature vector. We show that our approach fuses records with an average precision of ~98% when source information of records is available, and ~94% without source information across a diverse array of real-world datasets. We compare our approach to a comprehensive collection of data fusion and entity consolidation methods considered in the literature. We show that our approach can achieve an average precision improvement of ~20%/~45% with/without source information respectively.

中文翻译：

记录融合：一种学习方法

记录融合是聚合多个记录的任务，这些记录对应于数据库中相同的现实世界实体。我们可以将记录融合视为一个机器学习问题，其目标是预测每个实体的每个属性的“正确”值。给定一个数据库，我们使用属性级、记录级和数据库级信号的组合为该数据库的每个单元格（或（行，列））构造一个特征向量。我们使用这个特征向量和真实信息来为数据库的每个属性学习一个分类器。我们的学习算法使用了一种新颖的阶段加法模型。在每个阶段，我们通过将原始特征向量的一部分与前一阶段预测计算出的特征相结合来构建一个新的特征向量。然后我们在新的特征空间上学习一个 softmax 分类器。这种贪婪的阶段性方法可以被视为一个深度模型，其中在每个阶段，我们都在为原始特征向量添加更复杂的非线性变换。我们表明，当记录的源信息可用时，我们的方法以 ~98% 的平均精度融合记录，而在各种真实世界数据集的不同数组中融合 ~94% 没有源信息。我们将我们的方法与文献中考虑的数据融合和实体合并方法的综合集合进行比较。我们表明，我们的方法可以在有/没有源信息的情况下分别实现~20%/~45% 的平均精度提高。我们正在为原始特征向量添加更复杂的非线性变换。我们表明，当记录的源信息可用时，我们的方法以 ~98% 的平均精度融合记录，而在各种真实世界数据集的不同数组中融合 ~94% 没有源信息。我们将我们的方法与文献中考虑的数据融合和实体合并方法的综合集合进行比较。我们表明，我们的方法可以在有/没有源信息的情况下分别实现~20%/~45% 的平均精度提高。我们正在为原始特征向量添加更复杂的非线性变换。我们表明，当记录的源信息可用时，我们的方法以 ~98% 的平均精度融合记录，而在各种真实世界数据集的不同数组中融合 ~94% 没有源信息。我们将我们的方法与文献中考虑的数据融合和实体合并方法的综合集合进行比较。我们表明，我们的方法可以在有/没有源信息的情况下分别实现~20%/~45% 的平均精度提高。我们将我们的方法与文献中考虑的数据融合和实体合并方法的综合集合进行比较。我们表明，我们的方法可以在有/没有源信息的情况下分别实现~20%/~45% 的平均精度提高。我们将我们的方法与文献中考虑的数据融合和实体合并方法的综合集合进行比较。我们表明，我们的方法可以在有/没有源信息的情况下分别实现~20%/~45% 的平均精度提高。

更新日期：2020-06-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文