Skip to main content

Advertisement

Log in

Research on Parallel Adaptive Canopy-K-Means Clustering Algorithm for Big Data Mining Based on Cloud Platform

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Firstly, this paper introduces the types of clustering algorithm, and introduces the classical K-means algorithm and canopy algorithm in detail. Then, combining the map reduce computing model and spark cloud computing framework, this paper introduces the parallel Canopy-K-means algorithm after using Canopy algorithm to optimize the initial value of K-means algorithm. However, because Canopy algorithm needs to introduce a new distance threshold parameter T2, and the parameter needs to be set by human experience, it is difficult to determine the parameter artificially for large data, so this paper proposes a parallel adaptive Canopy-K-means algorithm, which can be used in cloud computing framework to determine the distance threshold parameter T2 adaptively based on statistical method. Using the parallelism of Map-Reduce computing model, the parallel Canopy-K-means algorithm is optimized by adaptive parameter estimation, which solves the problem that parameters depend on manual experience selection in Canopy process. After introducing the relevant theories and derivation process of this algorithm, cloud computing experiment platform is built based on the Spark framework, and the contrast experiments were performed using the Stanford Large Network Dataset Collection (SNAP) dataset and self-built Dimension Networks dataset. The experimental results show that the proposed method is effective.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Tan, Z.H., Jia, W.M., Jin, W.: Robust adaptive beamforming using k-means clustering: A solution to high complexity of the reconstruction-based algorithm[J]. Radioengineering. 27(2), 595–601 (2018)

    Article  Google Scholar 

  2. Xu, L., Lü, J.: Bayberry image segmentation based on homomorphic filtering and K-means clustering algorithm[J]. Transactions of the Chinese Society of Agricultural Engineering. 31(14), 202–208 (2015)

    Google Scholar 

  3. Wang, Z., Kaiyi, W., Shouhui, P., et al.: Segmentation of crop disease images with an improved K-means clustering algorithm[J]. Appl. Eng. Agric. 34(2), 277–289 (2018)

    Article  Google Scholar 

  4. Beom-Su, K., Monther, A., Ki-Il, K.: An efficient real-time data dissemination multicast protocol for big data in wireless sensor networks[J]. Journal of Grid Computing. 17(2), 341–355 (2019)

    Article  Google Scholar 

  5. Kurnianingsih, K., Nugroho, L.E., Widyawan, W., et al.: Personalized adaptive system for elderly care in smart home using fuzzy inference system[J]. International Journal of Pervasive Computing and Communications. 14(4), 210–232 (2018)

    Article  Google Scholar 

  6. Cai Q Q , Cui H G , Tang H . Big data mining analysis method based on cloud computing[C]// green energy and sustainable development I: Proceedings of the International Conference on Green Energy and Sustainable Development (GESD 2017). 2017: 020028-1–020028-4

  7. Li, H., Wang, S., Tang, R.: Research on the high robustness data classification and the mining algorithm based on hierarchical clustering and KNN[C]// international conference on communication & electronics systems. IEEE. 1–6 (2017)

  8. Sun H, Ji G, Zhao B, et al. A parallel algorithm for mining time relaxed gradual clustering pattern based on spatio-temporal trajectories[C]// Fifth International Conference on Advanced Cloud & Big Data. 2017 :308–313

  9. Daoping X , Alin Z , Yubo L . A parallel clustering algorithm implementation based on apache mahout[C]// 2016 Sixth international conference on Instrumentation & measurement, computer, communication and control (IMCCC). IEEE, 2016: 790–795

  10. Jabakji, A., Dag, H.: Improving item-based recommendation accuracy with user's preferences on apache mahout[C]// 2016 IEEE international conference on big data (big data). IEEE. 1742–1749 (2016)

  11. Mai, X., Cheng, J., Wang, S.: Research on semi supervised K-means clustering algorithm in data mining[J]. Clust. Comput. 21(9), 1–8 (2018)

    Google Scholar 

  12. Wang, Z., Wang, K., Pan, S.: Segmentation of crop disease images with an improved K-means clustering algorithm[J]. Appl. Eng. Agric. 34(2), 277–289 (2018)

    Article  MathSciNet  Google Scholar 

  13. Mehdizadeh, E., Teimouri, M., Zaretalab, A., et al.: A combined approach based on K-means and modified electromagnetism-like mechanism for data clustering[J]. International Journal of Information Technology & Decision Making. 16(5), 1279–1307 (2017)

    Article  Google Scholar 

  14. Hua Z, Zhou X. A novel clustering algorithm combining niche genetic algorithm with canopy and K-means[C]// International Conference on Artificial Intelligence & Big Data. 2018:26–32

  15. Nathiya, G., Punitha, S.C., Punithavalli, M.: An analytical study on behavior of clusters using K means, EM and K* means algorithm[J]. International Journal of Computer Science & Information Security. 7(3), 21–30 (2010)

    Google Scholar 

  16. Zhang, T., Ma, F.: Improved rough k-means clustering algorithm based on weighted distance measure with Gaussian function[J]. Int. J. Comput. Math. 94(4), 663–675 (2017)

    Article  MATH  MathSciNet  Google Scholar 

  17. Chunlin, L., Jianhang, T., Youlong, L.: Hybrid cloud adaptive scheduling strategy for heterogeneous workloads[J]. Journal of Grid Computing. 17(4), 1–28 (2019)

    Google Scholar 

  18. Kai, P., Leung, V.C.M., Huang, Q.: Clustering approach based on mini batch Kmeans for intrusion detection system over big data[J]. IEEE Access. 6(99), 11897–11906 (2018)

    Google Scholar 

  19. Righi, R.D.R., Lehmann, M., Gomes, M.M., et al.: A survey on global management view: toward combining system monitoring, resource management, and load prediction[J]. Journal of Grid Computing. 17(9), 1–30 (2019)

    Google Scholar 

  20. Haut, J.M., Paoletti, M., Plaza, J., et al.: Cloud implementation of the K-means algorithm for hyperspectral image analysis[J]. J. Supercomput. 73(1), 1–16 (2017)

    Article  Google Scholar 

  21. Tong, J.-F.: User clustering based on canopy+K-means algorithm in cloud computing[J]. Journal of Interdisciplinary Mathematics. 20(6–7), 1489–1492 (2017)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by The Applied Research Plan of Key Scientific Research Projects in Henan Colleges and Universities (No. 18B520028) and the Technology Plan Project of Henan Science (No. 182102210471).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dongliang Xia.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xia, D., Ning, F. & He, W. Research on Parallel Adaptive Canopy-K-Means Clustering Algorithm for Big Data Mining Based on Cloud Platform. J Grid Computing 18, 263–273 (2020). https://doi.org/10.1007/s10723-019-09504-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-019-09504-z

Keywords

Navigation