Parallel and scalable Dunn Index for the validation of big data clusters,Parallel Computing

当前位置： X-MOL 学术 › Parallel Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Parallel and scalable Dunn Index for the validation of big data clusters
Parallel Computing ( IF 2.0 ) Pub Date : 2021-01-26 , DOI: 10.1016/j.parco.2021.102751
Chiheb-Eddine Ben Ncir , Abdallah Hamza , Waad Bouaguel

Parallelizing data clustering algorithms has attracted the interest of many researchers over the past few years. Many efficient parallel algorithms were proposed to build partitioning over a huge volume of data. The effectiveness of these algorithms is attributed to the distribution of data among a cluster of nodes and to the parallel computation models. Although the effectiveness of parallel models to deal with increasing volume of data little work is done on the validation of big clusters. To deal with this issue, we propose a parallel and scalable model, referred to as S-DI (Scalable Dunn Index), to compute the Dunn Index measure for an internal validation of clustering results. Rather than computing the Dunn Index on a single machine in the clustering validation process, the new proposed measure is computed by distributing the partitioning among a cluster of nodes using a customized parallel model under Apache Spark framework. The proposed S-DI is also enhanced by a Sketch and Validate sampling technique which aims to approximate the Dunn Index value by using a small representative data-sample. Different experiments on simulated and real datasets showed a good scalability of our proposed measure and a reliable validation compared to other existing measures when handling large scale data.

中文翻译：

并行且可扩展的Dunn索引，用于验证大数据集群

在过去的几年中，并行数据聚类算法吸引了许多研究人员的兴趣。提出了许多有效的并行算法来构建对大量数据的分区。这些算法的有效性归因于节点簇之间的数据分布以及并行计算模型。尽管并行模型对于处理不断增加的数据量非常有效，但在验证大型集群方面却做得很少。为了解决此问题，我们提出了一个并行且可扩展的模型，称为S-DI（可扩展Dunn索引），用于计算Dunn索引度量以对聚类结果进行内部验证。而不是在集群验证过程中在一台计算机上计算Dunn索引，通过在Apache Spark框架下使用自定义的并行模型在节点群集之间分配分区来计算新提议的度量。草图和验证采样技术也增强了建议的S-DI，该技术旨在通过使用较小的代表性数据样本来近似Dunn Index值。在处理大规模数据时，与其他现有量度相比，对模拟数据集和真实数据集进行的不同实验显示，我们提出的量度具有良好的可伸缩性，并且可靠地验证了其有效性。

更新日期：2021-01-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11