当前位置: X-MOL 学术Distrib. Parallel. Databases › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A framework for dependency estimation in heterogeneous data streams
Distributed and Parallel Databases ( IF 1.5 ) Pub Date : 2020-06-06 , DOI: 10.1007/s10619-020-07295-x
Edouard Fouché , Alan Mazankiewicz , Florian Kalinke , Klemens Böhm

Estimating dependencies from data is a fundamental task of Knowledge Discovery. Identifying the relevant variables leads to a better understanding of data and improves both the runtime and the outcomes of downstream Data Mining tasks. Dependency estimation from static numerical data has received much attention. However, real-world data often occurs as heterogeneous data streams: On the one hand, data is collected online and is virtually infinite. On the other hand, the various components of a stream may be of different types, e.g., numerical, ordinal or categorical. For this setting, we propose Monte Carlo Dependency Estimation (MCDE), a framework that quantifies multivariate dependency as the average statistical discrepancy between marginal and conditional distributions, via Monte Carlo simulations. MCDE handles heterogeneity by leveraging three statistical tests: the Mann–Whitney U, the Kolmogorov–Smirnov and the Chi-Squared test. We demonstrate that MCDE goes beyond the state of the art regarding dependency estimation by meeting a broad set of requirements. Finally, we show with a real-world use case that MCDE can discover useful patterns in heterogeneous data streams.

中文翻译:

异构数据流中依赖估计的框架

估计数据的依赖性是知识发现的一项基本任务。识别相关变量有助于更好地理解数据,并改善下游数据挖掘任务的运行时间和结果。基于静态数值数据的相关性估计受到了广泛关注。然而,现实世界的数据通常以异构数据流的形式出现:一方面,数据是在线收集的,而且几乎是无限的。另一方面,流的各种组件可以是不同的类型,例如,数字的、有序的或分类的。对于这种情况,我们提出了蒙特卡罗依赖估计 (MCDE),这是一个框架,通过蒙特卡罗模拟将多元依赖量化为边际分布和条件分布之间的平均统计差异。MCDE 通过利用三个统计检验来处理异质性:Mann-Whitney U、Kolmogorov-Smirnov 和卡方检验。我们证明 MCDE 通过满足一系列广泛的要求,超越了有关依赖性估计的最新技术。最后,我们通过一个真实世界的用例展示了 MCDE 可以发现异构数据流中的有用模式。
更新日期:2020-06-06
down
wechat
bug