Research on Parallel Adaptive Canopy-K-Means Clustering Algorithm for Big Data Mining Based on Cloud Platform,Journal of Grid Computing

当前位置： X-MOL 学术 › J. Grid Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Research on Parallel Adaptive Canopy-K-Means Clustering Algorithm for Big Data Mining Based on Cloud Platform
Journal of Grid Computing ( IF 3.6 ) Pub Date : 2020-01-02 , DOI: 10.1007/s10723-019-09504-z
Dongliang Xia , Feifei Ning , Weina He

Firstly, this paper introduces the types of clustering algorithm, and introduces the classical K-means algorithm and canopy algorithm in detail. Then, combining the map reduce computing model and spark cloud computing framework, this paper introduces the parallel Canopy-K-means algorithm after using Canopy algorithm to optimize the initial value of K-means algorithm. However, because Canopy algorithm needs to introduce a new distance threshold parameter T2, and the parameter needs to be set by human experience, it is difficult to determine the parameter artificially for large data, so this paper proposes a parallel adaptive Canopy-K-means algorithm, which can be used in cloud computing framework to determine the distance threshold parameter T2 adaptively based on statistical method. Using the parallelism of Map-Reduce computing model, the parallel Canopy-K-means algorithm is optimized by adaptive parameter estimation, which solves the problem that parameters depend on manual experience selection in Canopy process. After introducing the relevant theories and derivation process of this algorithm, cloud computing experiment platform is built based on the Spark framework, and the contrast experiments were performed using the Stanford Large Network Dataset Collection (SNAP) dataset and self-built Dimension Networks dataset. The experimental results show that the proposed method is effective.

中文翻译：

云平台的大数据挖掘并行自适应冠层-K-均值聚类算法研究

首先，介绍了聚类算法的类型，并详细介绍了经典的K-means算法和冠层算法。然后，结合地图约简计算模型和火花云计算框架，介绍了使用Canopy算法优化K-means算法初始值后的并行Canopy-K-means算法。但是，由于Canopy算法需要引入一个新的距离阈值参数T2，并且该参数需要根据实际经验进行设置，因此对于大数据，人为地确定该参数比较困难，因此提出了一种并行自适应Canopy-K-means该算法可以在云计算框架中基于统计方法自适应地确定距离阈值参数T2。使用Map-Reduce计算模型的并行性，通过自适应参数估计对并行Canopy-K-means算法进行了优化，解决了Canopy过程中参数依赖于人工经验选择的问题。在介绍了该算法的相关理论和推导过程之后，基于Spark框架构建了云计算实验平台，并使用斯坦福大型网络数据集（SNAP）数据集和自建的Dimension Networks数据集进行了对比实验。实验结果表明，该方法是有效的。云计算实验平台是基于Spark框架构建的，对比实验是使用斯坦福大型网络数据集（SNAP）数据集和自建的Dimension Networks数据集进行的。实验结果表明，该方法是有效的。云计算实验平台是基于Spark框架构建的，对比实验是使用斯坦福大型网络数据集（SNAP）数据集和自建的Dimension Networks数据集进行的。实验结果表明，该方法是有效的。

更新日期：2020-01-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11