当前位置: X-MOL 学术ACM Trans. Intell. Syst. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Bayesian Nonparametric Unsupervised Concept Drift Detection for Data Stream Mining
ACM Transactions on Intelligent Systems and Technology ( IF 5 ) Pub Date : 2020-11-24 , DOI: 10.1145/3420034
Junyu Xuan 1 , Jie Lu 1 , Guangquan Zhang 1
Affiliation  

Online data stream mining is of great significance in practice because of its ubiquity in many real-world scenarios, especially in the big data era. Traditional data mining algorithms cannot be directly applied to data streams due to (1) the possible change of underlying data distribution over time (i.e., concept drift ) and (2) delayed, short, or even no labels for streaming data in practice. A new research area, named unsupervised concept drift detection , has emerged to tackle this difficulty mainly based on two-sample hypothesis tests, such as the Kolmogorov–Smirnov test. However, it is surprising that none of the existing methods in this area exploit the Bayesian nonparametric hypothesis test, which has clear interpretability and straightforward prior knowledge encoding ability and no strict or unrealistic requirement of prefixing the form for the underlying data distribution. In this article, we present a Bayesian nonparametric unsupervised concept drift detection method based on the Polya tree hypothesis test. The basic idea is to decompose the underlying data distribution into a multi-resolution representation that transforms the whole distribution hypothesis test into recursive and simple binomial tests. Also, an incremental mechanism is especially designed to improve its efficiency in the stream setting. The method effectively detect drifts, and it also locates where a drift happens and the posteriors of hypotheses. The experiments on synthetic data verify the desired properties of the proposed method, and the experiments on real-world data show the better performance of the method for data stream mining compared with its frequentist counterpart in the literature.

中文翻译:

用于数据流挖掘的贝叶斯非参数无监督概念漂移检测

在线数据流挖掘在实践中具有重要意义,因为它在许多现实世界的场景中无处不在,尤其是在大数据时代。传统的数据挖掘算法不能直接应用于数据流,因为(1)底层数据分布可能随时间变化(即,概念漂移) 和 (2) 在实践中流数据的延迟、短甚至没有标签。一个新的研究领域,命名为无监督概念漂移检测, 已经出现以解决这一难题,主要基于两样本假设检验,例如 Kolmogorov-Smirnov 检验。然而,令人惊讶的是,该领域的现有方法都没有利用贝叶斯非参数假设检验,该检验具有清晰的可解释性和直接的先验知识编码能力,并且没有严格或不切实际的要求为基础数据分布的形式添加前缀。在本文中,我们提出了一种基于 Polya 树假设检验的贝叶斯非参数无监督概念漂移检测方法。基本思想是将基础数据分布分解为多分辨率表示,将整个分布假设检验转换为递归和简单的二项式检验。还,增量机制专门设计用于提高其在流设置中的效率。该方法有效地检测漂移,并且它还定位漂移发生的位置和假设的后验。合成数据的实验验证了所提出方法的预期特性,真实世界数据的实验表明,与文献中的常客对应物相比,该方法在数据流挖掘方面的性能更好。
更新日期:2020-11-24
down
wechat
bug