当前位置: X-MOL 学术IEEE Trans. Knowl. Data. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Enriching Data Imputation under Similarity Rule Constraints
IEEE Transactions on Knowledge and Data Engineering ( IF 8.9 ) Pub Date : 2020-02-01 , DOI: 10.1109/tkde.2018.2883103
Shaoxu Song , Yu Sun , Aoqian Zhang , Lei Chen , Jianmin Wang

Incomplete information often occurs along with many database applications, e.g., in data integration, data cleaning, or data exchange. The idea of data imputation is often to fill the missing data with the values of its neighbors who share the same/similar information. Such neighbors could either be identified certainly by editing rules or extensively by similarity relationships. Owing to data sparsity, the number of neighbors identified by editing rules w.r.t. value equality is rather limited, especially in the presence of data values with variances. To enrich the imputation candidates, a natural idea is to extensively consider the neighbors with similarity relationship. However, the candidates suggested by these (heterogenous) similarity neighbors may conflict with each other. In this paper, we propose to utilize the similarity rules with tolerance to small variations (instead of the aforesaid editing rules with strict equality constraints) to rule out the invalid candidates provided by similarity neighbors. To enrich the data imputation, i.e., imputing the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions include (1) the np-hardness analysis on solving as well as approximating the problem, (2) exact algorithms for tackling the problem, and (3) efficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate the superiority of our proposal in filling accuracy. We also demonstrate that the record matching application is indeed improved, after applying the proposed imputation.

中文翻译:

在相似性规则约束下丰富数据插补

不完整的信息经常与许多数据库应用程序一起出现,例如,在数据集成、数据清理或数据交换中。数据插补的想法通常是用共享相同/相似信息的邻居的值来填充缺失的数据。这样的邻居可以通过编辑规则确定或通过相似关系广泛识别。由于数据的稀疏性,通过值相等的编辑规则识别的邻居数量相当有限,特别是在存在具有方差的数据值的情况下。为了丰富插补候选,一个自然的想法是广泛考虑具有相似关系的邻居。但是,这些(异类)相似性邻居建议的候选者可能会相互冲突。在本文中,我们建议利用容忍小变化的相似性规则(而不是上述具有严格等式约束的编辑规则)来排除相似性邻居提供的无效候选者。为了丰富数据插补,即更多地插补缺失值,我们研究了最大化缺失数据插补的问题。我们的主要贡献包括(1)解决问题和逼近问题的 np-hardness 分析,(2)解决问题的精确算法,以及(3)具有性能保证的有效逼近。在真实和合成数据集上的实验证明了我们的建议在填充精度方面的优越性。我们还证明,在应用建议的插补后,记录匹配应用程序确实得到了改进。
更新日期:2020-02-01
down
wechat
bug