Quality-driven early stopping for explorative cluster analysis for big data,SICS Software-Intensive Cyber-Physical Systems

当前位置： X-MOL 学术 › SICS Softw.-Inensiv. Cyber-Phys. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Quality-driven early stopping for explorative cluster analysis for big data
SICS Software-Intensive Cyber-Physical Systems Pub Date : 2019-02-06 , DOI: 10.1007/s00450-019-00401-0
Manuel Fritz , Michael Behringer , Holger Schwarz

Data analysis has become a critical success factor for companies in all areas. Hence, it is necessary to quickly gain knowledge from available datasets, which is becoming especially challenging in times of big data. Typical data mining tasks like cluster analysis are very time consuming even if they run in highly parallel environments like Spark clusters. To support data scientists in explorative data analysis processes, we need techniques to make data mining tasks even more efficient. To this end, we introduce a novel approach to stop clustering algorithms as early as possible while still achieving an adequate quality of the detected clusters. Our approach exploits the iterative nature of many cluster algorithms and uses a metric to decide after which iteration the mining task should stop. We present experimental results based on a Spark cluster using multiple huge datasets. The experiments unveil that our approach is able to accelerate the clustering up to a factor of more than 800 by obliterating many iterations which provide only little gain in quality. This way, we are able to find a good balance between the time required for data analysis and quality of the analysis results.

中文翻译：

用于大数据探索性聚类分析的质量驱动早期停止

数据分析已成为各个领域公司成功的关键因素。因此，有必要从可用数据集中快速获取知识，这在大数据时代变得尤其具有挑战性。典型的数据挖掘任务（例如集群分析）即使在 Spark 集群等高度并行的环境中运行，也非常耗时。为了支持数据科学家进行探索性数据分析过程，我们需要一些技术来使数据挖掘任务更加高效。为此，我们引入了一种新颖的方法来尽早停止聚类算法，同时仍然实现检测到的聚类的足够质量。我们的方法利用了许多集群算法的迭代性质，并使用一个度量来决定挖掘任务应该在哪次迭代之后停止。我们展示了基于使用多个庞大数据集的 Spark 集群的实验结果。实验表明，我们的方法能够通过消除许多只提供很少的质量增益的迭代，将聚类加速到 800 倍以上。这样，我们就能够在数据分析所需的时间和分析结果的质量之间找到良好的平衡。

更新日期：2019-02-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文