当前位置: X-MOL 学术VLDB J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Automatic weighted matching rectifying rule discovery for data repairing
The VLDB Journal ( IF 2.8 ) Pub Date : 2020-06-09 , DOI: 10.1007/s00778-020-00617-6
Hiba Abu Ahmad , Hongzhi Wang

Data repairing is a key problem in data cleaning which aims to uncover and rectify data errors. Traditional methods depend on data dependencies to check the existence of errors in data, but they fail to rectify the errors. To overcome this limitation, recent methods define repairing rules on which they depend to detect and fix errors. However, all existing data repairing rules are provided by experts which is an expensive task in time and effort. Besides, rule-based data repairing methods need an external verified data source or user verifications; otherwise, they are incomplete where they can repair only a small number of errors. In this paper, we define weighted matching rectifying rules (WMRRs) based on similarity matching to capture more errors. We propose a novel algorithm to discover WMRRs automatically from dirty data in-hand. We also develop an automatic algorithm for rules inconsistency resolution. Additionally, based on WMRRs, we propose an automatic data repairing algorithm (WMRR-DR) which uncovers a large number of errors and rectifies them dependably. We experimentally verify our method on both real-life and synthetic data. The experimental results prove that our method can discover effective WMRRs from dirty data in-hand and perform dependable and full-automatic repairing based on the discovered WMRRs, with higher accuracy than the existing dependable methods.



中文翻译:

自动加权匹配纠正规则发现以进行数据修复

数据修复是旨在发现和纠正数据错误的数据清理中的关键问题。传统方法依靠数据依赖性来检查数据中是否存在错误,但是它们无法纠正错误。为了克服此限制,最近的方法定义了修复规则,这些规则是检测和修复错误所依赖的。但是,所有现有的数据修复规则都是由专家提供的,这是费时费力的工作。此外,基于规则的数据修复方法需要外部验证的数据源或用户验证。否则,它们是不完整的,只能修复少量错误。在本文中,我们基于相似性匹配定义了加权匹配纠正规则(WMRR),以捕获更多错误。我们提出了一种新颖的算法,可以从脏数据中自动发现WMRR。我们还开发了一种用于规则不一致解决的自动算法。此外,基于WMRR,我们提出了一种自动数据修复算法(WMRR-DR),该算法可发现大量错误并可靠地纠正它们。我们通过实验验证了我们在真实数据和综合数据上的方法。实验结果证明,该方法可以从脏数据中发现有效的WMRR,并基于发现的WMRR进行可靠,全自动的修复,其准确性要高于现有的可靠方法。

更新日期:2020-06-09
down
wechat
bug