当前位置: X-MOL 学术J. Exp. Theor. Artif. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Anytime clustering of data streams while handling noise and concept drift
Journal of Experimental & Theoretical Artificial Intelligence ( IF 1.7 ) Pub Date : 2021-03-15 , DOI: 10.1080/0952813x.2021.1882001
Jagat Sesh Challa 1 , Poonam Goyal 1 , Ajinkya Kokandakar 1 , Dhananjay Mantri 1 , Pranet Verma 1 , Sundar Balasubramaniam 1 , Navneet Goyal 1
Affiliation  

ABSTRACT

Clustering of data streams has become very popular in recent times, owing to rapid rise of real-time streaming utilities that produce large amounts of data at varying inter-arrival rates. We propose AnyClus, a framework for anytime clustering of data streams. AnyClus uses a proposed variant of R-tree, AnyRTree, to capture the incoming stream objects arriving at variable rate, and to index them in the form of micro-clusters of hierarchical fashion. The leaf-level micro-clusters produced are aggregated and stored in a logarithmic tilted-time window framework (TTWF). Our extensive experimental analysis shows (i) the capability of AnyClus in handling variable stream speeds (upto 250k objects/second); (ii) its ability to produce micro-clusters of high purity (≈1) and compactness; (iii) effectiveness of AnyRTree in handling noise, capturing concept drift and preservation of spatial locality in the indexing of micro-clusters, when compared to the existing methods. We also propose a parallel framework, Any-MP-Clus, for anytime clustering of multiport data streams over commodity clusters. Any-MP-Clus uses AnyRTree at each computing node of the cluster (for each stream-port) and maintains the aggregated micro-clusters in TTWF. The experimental results on datasets of billions scale show that Any-MP-Clus is scalable, efficient and produces clustering of higher quality.



中文翻译:

在处理噪声和概念漂移时随时对数据流进行聚类

摘要

由于实时流实用程序的迅速兴起,这些实用程序以不同的到达间隔速率产生大量数据,因此数据流的聚类最近变得非常流行。我们提出了AnyClus,这是一个用于随时对数据流进行聚类的框架。AnyClus使用提议的 R-tree 变体AnyRTree来捕获以可变速率到达的传入流对象,并以分层方式的微集群的形式对它们进行索引。产生的叶级微集群被聚合并存储在对数倾斜时间窗口框架(TTWF)中。我们广泛的实验分析表明 (i) AnyClus的能力处理可变流速度(高达 250k 个对象/秒);(ii) 其产生高纯度 (≈1) 和致密的微团簇的能力;(iii) 与现有方法相比,AnyRTree 在处理噪声、捕获概念漂移和保持微集群索引中的空间局部性方面的有效性。我们还提出了一个并行框架Any-MP-Clus,用于在商品集群上随时对多端口数据流进行集群。Any-MP-Clus在集群的每个计算节点(针对每个流端口)使用 AnyRTree,并在 TTWF 中维护聚合的微集群。在数十亿规模的数据集上的实验结果表明,Any-MP-Clus具有可扩展性、高效性并产生更高质量的聚类。

更新日期:2021-03-15
down
wechat
bug