当前位置: X-MOL 学术IEEE Trans. Knowl. Data. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Efficient Entity Resolution on Heterogeneous Records
IEEE Transactions on Knowledge and Data Engineering ( IF 8.9 ) Pub Date : 2020-05-01 , DOI: 10.1109/tkde.2019.2898191
Yiming Lin , Hongzhi Wang , Jianzhong Li , Hong Gao

Entity resolution (ER) is the problem of identifying and merging records that refer to the same real-world entity. In many scenarios, raw records are stored under heterogeneous environment. Specifically, the schemas of records may differ from each other. To leverage such records better, most existing work assume that schema matching and data exchange have been done to convert records under different schemas to those under a predefined schema. However, we observe that schema matching would lose information in some cases, which could be useful or even crucial to ER. To leverage sufficient information from heterogeneous sources, in this paper, we address several challenges of ER on heterogeneous records and show that none of existing similarity metrics or their transformations could be applied to find similar records under heterogeneous settings. Motivated by this, we design the similarity function and propose a novel framework to iteratively find records which refer to the same entity. Regarding efficiency, we build an index to generate candidates and accelerate similarity computation. Evaluations on real-world datasets show the effectiveness and efficiency of our methods.

中文翻译:

异构记录的高效实体解析

实体解析 (ER) 是识别和合并引用同一现实世界实体的记录的问题。在许多场景中,原始记录存储在异构环境下。具体而言,记录的模式可能彼此不同。为了更好地利用这些记录,大多数现有工作都假设已经完成了模式匹配和数据交换,以将不同模式下的记录转换为预定义模式下的记录。但是,我们观察到模式匹配在某些情况下会丢失信息,这对 ER 可能有用甚至至关重要。为了利用来自异构源的足够信息,在本文中,我们解决了异构记录上 ER 的几个挑战,并表明现有的相似性度量或其转换都不能应用于在异构设置下查找相似记录。受此启发,我们设计了相似度函数并提出了一个新颖的框架来迭代地查找引用同一实体的记录。关于效率,我们建立了一个索引来生成候选者并加速相似度计算。对真实世界数据集的评估显示了我们方法的有效性和效率。
更新日期:2020-05-01
down
wechat
bug