evoStream – Evolutionary Stream Clustering Utilizing Idle Times,Big Data Research

当前位置： X-MOL 学术 › Big Data Res. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

evoStream – Evolutionary Stream Clustering Utilizing Idle Times
Big Data Research ( IF 3.5 ) Pub Date : 2018-05-30 , DOI: 10.1016/j.bdr.2018.05.005
Matthias Carnein , Heike Trautmann

Clustering is an important field in data mining that aims to reveal hidden patterns in data sets. It is widely popular in marketing or medical applications and used to identify groups of similar objects. Clustering possibly unbounded and evolving data streams is of particular interest due to the widespread deployment of large and fast data sources such as sensors. The vast majority of stream clustering algorithms employ a two-phase approach where the stream is first summarized in an online phase. Upon request, an offline phase reclusters the aggregations into the final clusters. In this setup, the online component will idle and wait for the next observation in times where the stream is slow. This paper proposes a new stream clustering algorithm called evoStream which performs evolutionary optimization in the idle times of the online phase to incrementally build and refine the final clusters. Since the online phase would idle otherwise, our approach does not reduce the processing speed while effectively removing the computational overhead of the offline phase. In extensive experiments on real data streams we show that the proposed algorithm allows to output clusters of high quality at any time within the stream without the need for additional computational resources.

中文翻译：

evoStream –利用空闲时间的演进式流聚类

聚类是数据挖掘中的一个重要领域，旨在揭示数据集中的隐藏模式。它在市场营销或医疗应用中广泛流行，并用于标识相似对象的组。由于大而快速的数据源（例如传感器）的广泛部署，对可能无界且不断发展的数据流进行群集特别令人关注。绝大多数流聚类算法采用两阶段方法，其中流首先在在线阶段中汇总。根据请求，脱机阶段会将聚合重新组合到最终群集中。在此设置中，在线组件将闲置并在流慢的时候等待下一次观察。本文提出了一种新的流聚类算法evoStream在在线阶段的空闲时间执行进化优化，以逐步构建和完善最终集群。由于在线阶段会闲置，因此我们的方法不会降低处理速度，同时有效地消除了离线阶段的计算开销。在真实数据流的大量实验中，我们证明了所提出的算法可以在流中的任何时间输出高质量的簇，而无需额外的计算资源。

更新日期：2018-05-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文