Complexity and Efficient Algorithms for Data Inconsistency Evaluating and Repairing,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Complexity and Efficient Algorithms for Data Inconsistency Evaluating and Repairing
arXiv - CS - Databases Pub Date : 2020-01-02 , DOI: arxiv-2001.00315
Dongjing Miao, Zhipeng Cai, Jianzhong Li, Xiangyu Gao, Xianmin Liu

Data inconsistency evaluating and repairing are major concerns in data quality management. As the basic computing task, optimal subset repair is not only applied for cost estimation during the progress of database repairing, but also directly used to derive the evaluation of database inconsistency. Computing an optimal subset repair is to find a minimum tuple set from an inconsistent database whose remove results in a consistent subset left. Tight bound on the complexity and efficient algorithms are still unknown. In this paper, we improve the existing complexity and algorithmic results, together with a fast estimation on the size of optimal subset repair. We first strengthen the dichotomy for optimal subset repair computation problem, we show that it is not only APXcomplete, but also NPhard to approximate an optimal subset repair with a factor better than $17/16$ for most cases. We second show a $(2-0.5^{\tiny\sigma-1})$-approximation whenever given $\sigma$ functional dependencies, and a $(2-\eta_k+\frac{\eta_k}{k})$-approximation when an $\eta_k$-portion of tuples have the $k$-quasi-Tur$\acute{\text{a}}$n property for some $k>1$. We finally show a sublinear estimator on the size of optimal \textit{S}-repair for subset queries, it outputs an estimation of a ratio $2n+\epsilon n$ with a high probability, thus deriving an estimation of FD-inconsistency degree of a ratio $2+\epsilon$. To support a variety of subset queries for FD-inconsistency evaluation, we unify them as the $\subseteq$-oracle which can answer membership-query, and return $p$ tuples uniformly sampled whenever given a number $p$. Experiments are conducted on range queries as an implementation of $\subseteq$-oracle, and results show the efficiency of our FD-inconsistency degree estimator.

中文翻译：

用于数据不一致评估和修复的复杂性和高效算法

数据不一致评估和修复是数据质量管理中的主要问题。作为基本的计算任务，最优子集修复不仅用于数据库修复过程中的成本估算，而且直接用于导出数据库不一致的评估。计算最优子集修复是从不一致的数据库中找到最小元组集，其删除导致留下一致子集。对复杂性和高效算法的严格限制仍然未知。在本文中，我们改进了现有的复杂性和算法结果，同时对最优子集修复的大小进行了快速估计。我们首先加强了最优子集修复计算问题的二分法，我们证明它不仅是 APXcomplete，但在大多数情况下，NPhard 也可以用优于 $17/16$ 的因子来近似最优子集修复。我们第二次展示了一个 $(2-0.5^{\tiny\sigma-1})$-近似值，只要给定 $\sigma$ 函数依赖，以及 $(2-\eta_k+\frac{\eta_k}{k})$ - 当元组的 $\eta_k$-部分具有 $k$-quasi-Tur$\acute{\text{a}}$n 属性时，某些 $k>1$ 的近似值。我们最终展示了一个关于子集查询的最优 \textit{S}-repair 大小的次线性估计器，它以高概率输出比率 $2n+\epsilonn$ 的估计，从而推导出对 FD 不一致程度的估计比率 $2+\epsilon$。为了支持 FD 不一致评估的各种子集查询，我们将它们统一为 $\subseteq$-oracle，它可以回答成员资格查询，并返回 $p$ 元组，每当给定一个数字 $p$ 时统一采样。

更新日期：2020-01-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文