Improved K-Means Clustering Algorithm for Big Data Mining under Hadoop Parallel Framework,Journal of Grid Computing

当前位置： X-MOL 学术 › J. Grid Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Improved K-Means Clustering Algorithm for Big Data Mining under Hadoop Parallel Framework
Journal of Grid Computing ( IF 3.6 ) Pub Date : 2019-12-20 , DOI: 10.1007/s10723-019-09503-0
Weijia Lu

In order to improve the accuracy and efficiency of the clustering mining algorithm, this paper focuses on the clustering mining algorithm for large data. Firstly, the traditional clustering mining algorithm is improved to improve the accuracy, and then the improved clustering algorithm is parallelized to improve the efficiency. In order to improve the accuracy of clustering, an incremental K-means clustering algorithm based on density is proposed on the basis of K-means algorithm. Firstly, the density of data points is calculated, and each basic cluster is composed of the center points whose density is not less than the given threshold and the points within the density range. Then, the basic cluster is merged according to the distance between the two cluster centers. Finally, the points that are not divided into any cluster are divided into the clusters nearest to them. In order to improve the efficiency of the algorithm and reduce the time complexity of the algorithm, the distributed database was used to simulate the shared memory space and parallelize the algorithm on the Hadoop platform of cloud computing. The simulation results show that the clustering accuracy of the proposed algorithm is higher than that of the other two algorithms by more than 10%.

中文翻译：

Hadoop并行框架下用于大数据挖掘的改进的K-Means聚类算法

为了提高聚类挖掘算法的准确性和效率，本文着重研究大数据聚类挖掘算法。首先，对传统的聚类挖掘算法进行了改进，以提高精度，然后对改进后的聚类算法进行并行化处理，以提高效率。为了提高聚类的准确性，在K-means算法的基础上提出了一种基于密度的增量式K-means聚类算法。首先，计算数据点的密度，每个基本簇由密度不小于给定阈值的中心点和密度范围内的点组成。然后，根据两个群集中心之间的距离合并基本群集。最后，未划分为任何聚类的点被划分为最接近它们的聚类。为了提高算法的效率并减少算法的时间复杂度，使用分布式数据库在云计算的Hadoop平台上模拟共享内存空间并使算法并行化。仿真结果表明，该算法的聚类精度比其他两种算法高出10％以上。

更新日期：2019-12-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11