Abstract
Firstly, this paper introduces the types of clustering algorithm, and introduces the classical K-means algorithm and canopy algorithm in detail. Then, combining the map reduce computing model and spark cloud computing framework, this paper introduces the parallel Canopy-K-means algorithm after using Canopy algorithm to optimize the initial value of K-means algorithm. However, because Canopy algorithm needs to introduce a new distance threshold parameter T2, and the parameter needs to be set by human experience, it is difficult to determine the parameter artificially for large data, so this paper proposes a parallel adaptive Canopy-K-means algorithm, which can be used in cloud computing framework to determine the distance threshold parameter T2 adaptively based on statistical method. Using the parallelism of Map-Reduce computing model, the parallel Canopy-K-means algorithm is optimized by adaptive parameter estimation, which solves the problem that parameters depend on manual experience selection in Canopy process. After introducing the relevant theories and derivation process of this algorithm, cloud computing experiment platform is built based on the Spark framework, and the contrast experiments were performed using the Stanford Large Network Dataset Collection (SNAP) dataset and self-built Dimension Networks dataset. The experimental results show that the proposed method is effective.
Similar content being viewed by others
References
Tan, Z.H., Jia, W.M., Jin, W.: Robust adaptive beamforming using k-means clustering: A solution to high complexity of the reconstruction-based algorithm[J]. Radioengineering. 27(2), 595–601 (2018)
Xu, L., Lü, J.: Bayberry image segmentation based on homomorphic filtering and K-means clustering algorithm[J]. Transactions of the Chinese Society of Agricultural Engineering. 31(14), 202–208 (2015)
Wang, Z., Kaiyi, W., Shouhui, P., et al.: Segmentation of crop disease images with an improved K-means clustering algorithm[J]. Appl. Eng. Agric. 34(2), 277–289 (2018)
Beom-Su, K., Monther, A., Ki-Il, K.: An efficient real-time data dissemination multicast protocol for big data in wireless sensor networks[J]. Journal of Grid Computing. 17(2), 341–355 (2019)
Kurnianingsih, K., Nugroho, L.E., Widyawan, W., et al.: Personalized adaptive system for elderly care in smart home using fuzzy inference system[J]. International Journal of Pervasive Computing and Communications. 14(4), 210–232 (2018)
Cai Q Q , Cui H G , Tang H . Big data mining analysis method based on cloud computing[C]// green energy and sustainable development I: Proceedings of the International Conference on Green Energy and Sustainable Development (GESD 2017). 2017: 020028-1–020028-4
Li, H., Wang, S., Tang, R.: Research on the high robustness data classification and the mining algorithm based on hierarchical clustering and KNN[C]// international conference on communication & electronics systems. IEEE. 1–6 (2017)
Sun H, Ji G, Zhao B, et al. A parallel algorithm for mining time relaxed gradual clustering pattern based on spatio-temporal trajectories[C]// Fifth International Conference on Advanced Cloud & Big Data. 2017 :308–313
Daoping X , Alin Z , Yubo L . A parallel clustering algorithm implementation based on apache mahout[C]// 2016 Sixth international conference on Instrumentation & measurement, computer, communication and control (IMCCC). IEEE, 2016: 790–795
Jabakji, A., Dag, H.: Improving item-based recommendation accuracy with user's preferences on apache mahout[C]// 2016 IEEE international conference on big data (big data). IEEE. 1742–1749 (2016)
Mai, X., Cheng, J., Wang, S.: Research on semi supervised K-means clustering algorithm in data mining[J]. Clust. Comput. 21(9), 1–8 (2018)
Wang, Z., Wang, K., Pan, S.: Segmentation of crop disease images with an improved K-means clustering algorithm[J]. Appl. Eng. Agric. 34(2), 277–289 (2018)
Mehdizadeh, E., Teimouri, M., Zaretalab, A., et al.: A combined approach based on K-means and modified electromagnetism-like mechanism for data clustering[J]. International Journal of Information Technology & Decision Making. 16(5), 1279–1307 (2017)
Hua Z, Zhou X. A novel clustering algorithm combining niche genetic algorithm with canopy and K-means[C]// International Conference on Artificial Intelligence & Big Data. 2018:26–32
Nathiya, G., Punitha, S.C., Punithavalli, M.: An analytical study on behavior of clusters using K means, EM and K* means algorithm[J]. International Journal of Computer Science & Information Security. 7(3), 21–30 (2010)
Zhang, T., Ma, F.: Improved rough k-means clustering algorithm based on weighted distance measure with Gaussian function[J]. Int. J. Comput. Math. 94(4), 663–675 (2017)
Chunlin, L., Jianhang, T., Youlong, L.: Hybrid cloud adaptive scheduling strategy for heterogeneous workloads[J]. Journal of Grid Computing. 17(4), 1–28 (2019)
Kai, P., Leung, V.C.M., Huang, Q.: Clustering approach based on mini batch Kmeans for intrusion detection system over big data[J]. IEEE Access. 6(99), 11897–11906 (2018)
Righi, R.D.R., Lehmann, M., Gomes, M.M., et al.: A survey on global management view: toward combining system monitoring, resource management, and load prediction[J]. Journal of Grid Computing. 17(9), 1–30 (2019)
Haut, J.M., Paoletti, M., Plaza, J., et al.: Cloud implementation of the K-means algorithm for hyperspectral image analysis[J]. J. Supercomput. 73(1), 1–16 (2017)
Tong, J.-F.: User clustering based on canopy+K-means algorithm in cloud computing[J]. Journal of Interdisciplinary Mathematics. 20(6–7), 1489–1492 (2017)
Acknowledgments
This work was supported by The Applied Research Plan of Key Scientific Research Projects in Henan Colleges and Universities (No. 18B520028) and the Technology Plan Project of Henan Science (No. 182102210471).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xia, D., Ning, F. & He, W. Research on Parallel Adaptive Canopy-K-Means Clustering Algorithm for Big Data Mining Based on Cloud Platform. J Grid Computing 18, 263–273 (2020). https://doi.org/10.1007/s10723-019-09504-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-019-09504-z