当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
S-RASTER: contraction clustering for evolving data streams
Journal of Big Data ( IF 8.1 ) Pub Date : 2020-08-13 , DOI: 10.1186/s40537-020-00336-3
Gregor Ulm , Simon Smith , Adrian Nilsson , Emil Gustavsson , Mats Jirstrand

Contraction Clustering (RASTER) is a single-pass algorithm for density-based clustering of 2D data. It can process arbitrary amounts of data in linear time and in constant memory, quickly identifying approximate clusters. It also exhibits good scalability in the presence of multiple CPU cores. RASTER exhibits very competitive performance compared to standard clustering algorithms, but at the cost of decreased precision. Yet, RASTER is limited to batch processing and unable to identify clusters that only exist temporarily. In contrast, S-RASTER is an adaptation of RASTER to the stream processing paradigm that is able to identify clusters in evolving data streams. This algorithm retains the main benefits of its parent algorithm, i.e. single-pass linear time cost and constant memory requirements for each discrete time step within a sliding window. The sliding window is efficiently pruned, and clustering is still performed in linear time. Like RASTER, S-RASTER trades off an often negligible amount of precision for speed. Our evaluation shows that competing algorithms are at least 50% slower. Furthermore, S-RASTER shows good qualitative results, based on standard metrics. It is very well suited to real-world scenarios where clustering does not happen continually but only periodically.

中文翻译:

S-RASTER:用于不断发展的数据流的收缩聚类

压缩聚类(RASTER)是用于基于密度的2D数据聚类的单遍算法。它可以在线性时间和恒定内存中处理任意数量的数据,从而快速识别近似簇。在存在多个CPU内核的情况下,它还具有良好的可伸缩性。与标准聚类算法相比,RASTER表现出极具竞争力的性能,但以降低精度为代价。但是,RASTER仅限于批处理,无法识别仅临时存在的群集。相比之下,S-RASTER是RASTER对流处理范例的改编,它能够识别不断发展的数据流中的簇。该算法保留了其父算法的主要优点,即单遍线性时间成本和滑动窗口内每个离散时间步长的恒定存储要求。滑动窗口被有效修剪,并且聚类仍在线性时间内执行。像RASTER一样,S-RASTER在速度精度上通常可以忽略不计。我们的评估表明,竞争算法至少要慢50%。此外,基于标准指标,S-RASTER显示出良好的定性结果。它非常适合群集不是连续发生而是仅定期发生的实际情况。
更新日期:2020-08-13
down
wechat
bug