Statistical hierarchical clustering algorithm for outlier detection in evolving data streams,Machine Learning

当前位置： X-MOL 学术 › Mach. Learn. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Statistical hierarchical clustering algorithm for outlier detection in evolving data streams
Machine Learning ( IF 7.5 ) Pub Date : 2020-09-04 , DOI: 10.1007/s10994-020-05905-4
Dalibor Krleža , Boris Vrdoljak , Mario Brčić

Anomaly detection is a hard data analysis process that requires constant creation and improvement of data analysis algorithms. Using traditional clustering algorithms to analyse data streams is impossible due to processing power and memory issues. To solve this, the traditional clustering algorithm complexity needed to be reduced, which led to the creation of sequential clustering algorithms. The usual approach is two-phase clustering, which uses online phase to relax data details and complexity, and offline phase to cluster concepts created in the online phase. Detecting anomalies in a data stream is usually solved in the online phase, as it requires unreduced data. Contrarily, producing good macro-clustering is done in the offline phase, which is the reason why two-phase clustering algorithms have difficulty being equally good in anomaly detection and macro-clustering. In this paper, we propose a statistical hierarchical clustering algorithm equally suitable for both detecting anomalies and macro-clustering. The proposed algorithm is single-phased and uses statistical inference on the input data stream, resulting in statistical distributions that are constantly updated. This makes the classification adaptable, allowing agglomeration of outliers into clusters, tracking population evolution, and to be used without knowing the expected number of clusters and outliers. The proposed algorithm was tested against typical clustering algorithms, including two-phase algorithms suitable for data stream analysis. A number of typical test cases were selected, to show the universality and qualities of the proposed clustering algorithm.

中文翻译：

用于演化数据流中异常值检测的统计层次聚类算法

异常检测是一个硬数据分析过程，需要不断创建和改进数据分析算法。由于处理能力和内存问题，使用传统的聚类算法来分析数据流是不可能的。为了解决这个问题，需要降低传统聚类算法的复杂度，从而产生了顺序聚类算法。通常的方法是两阶段聚类，在线阶段放松数据的细节和复杂性，离线阶段对在线阶段创建的概念进行聚类。检测数据流中的异常通常在在线阶段解决，因为它需要未减少的数据。相反，产生良好的宏观聚类是在离线阶段完成的，这就是为什么两阶段聚类算法难以在异常检测和宏观聚类方面同样出色的原因。在本文中，我们提出了一种同样适用于检测异常和宏观聚类的统计层次聚类算法。所提出的算法是单阶段的，并对输入数据流使用统计推断，从而产生不断更新的统计分布。这使得分类具有适应性，允许将异常值聚集到集群中，跟踪种群进化，并在不知道集群和异常值的预期数量的情况下使用。所提出的算法针对典型的聚类算法进行了测试，包括适用于数据流分析的两阶段算法。选取了多个典型的测试用例，

更新日期：2020-09-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>