当前位置: X-MOL 学术Knowl. Inf. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Discovering dependencies with reliable mutual information
Knowledge and Information Systems ( IF 2.7 ) Pub Date : 2020-07-24 , DOI: 10.1007/s10115-020-01494-9
Panagiotis Mandros , Mario Boley , Jilles Vreeken

We consider the task of discovering functional dependencies in data for target attributes of interest. To solve it, we have to answer two questions: How do we quantify the dependency in a model-agnostic and interpretable way as well as reliably against sample size and dimensionality biases? How can we efficiently discover the exact or \(\alpha \)-approximate top-k dependencies? We address the first question by adopting information-theoretic notions. Specifically, we consider the mutual information score, for which we propose a reliable estimator that enables robust optimization in high-dimensional data. To address the second question, we then systematically explore the algorithmic implications of using this measure for optimization. We show the problem is NP-hard and justify worst-case exponential-time as well as heuristic search methods. We propose two bounding functions for the estimator, which we use as pruning criteria in branch-and-bound search to efficiently mine dependencies with approximation guarantees. Empirical evaluation shows that the derived estimator has desirable statistical properties, the bounding functions lead to effective exact and greedy search algorithms, and when combined, qualitative experiments show the framework indeed discovers highly informative dependencies.



中文翻译:

通过可靠的相互信息发现依赖关系

我们考虑为目标目标属性发现数据中的功能依赖性的任务。为了解决这个问题,我们必须回答两个问题:如何以模型不可知和可解释的方式量化依赖关系,以及如何可靠地对抗样本大小和维度偏差?我们如何有效地发现确切的或\(\ alpha \) -approximate top- k依赖?我们通过采用信息理论概念来解决第一个问题。具体来说,我们考虑了互信息得分,为此我们提出了一个可靠的估计器,该估计器能够对高维数据进行鲁棒优化。为了解决第二个问题,我们然后系统地探索使用此度量进行优化的算法含义。我们证明了问题是NP难的,并证明了最坏情况下的指数时间以及启发式搜索方法是正确的。我们为估计器提出了两个边界函数,我们将它们用作分支定界搜索中的修剪标准,以利用近似保证有效地挖掘依赖性。实证评估表明,导出的估算器具有理想的统计属性,边界函数导致有效的精确和贪婪搜索算法,并且当组合使用时,

更新日期:2020-07-24
down
wechat
bug