Research on Parallel Adaptive Canopy-K-Means Clustering Algorithm for Big Data Mining Based on Cloud Platform

Xia, Dongliang; Ning, Feifei; He, Weina

doi:10.1007/s10723-019-09504-z

Research on Parallel Adaptive Canopy-K-Means Clustering Algorithm for Big Data Mining Based on Cloud Platform

Published: 02 January 2020

Volume 18, pages 263–273, (2020)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Dongliang Xia¹,
Feifei Ning¹ &
Weina He¹

280 Accesses
36 Citations
3 Altmetric
Explore all metrics

Abstract

Firstly, this paper introduces the types of clustering algorithm, and introduces the classical K-means algorithm and canopy algorithm in detail. Then, combining the map reduce computing model and spark cloud computing framework, this paper introduces the parallel Canopy-K-means algorithm after using Canopy algorithm to optimize the initial value of K-means algorithm. However, because Canopy algorithm needs to introduce a new distance threshold parameter T2, and the parameter needs to be set by human experience, it is difficult to determine the parameter artificially for large data, so this paper proposes a parallel adaptive Canopy-K-means algorithm, which can be used in cloud computing framework to determine the distance threshold parameter T2 adaptively based on statistical method. Using the parallelism of Map-Reduce computing model, the parallel Canopy-K-means algorithm is optimized by adaptive parameter estimation, which solves the problem that parameters depend on manual experience selection in Canopy process. After introducing the relevant theories and derivation process of this algorithm, cloud computing experiment platform is built based on the Spark framework, and the contrast experiments were performed using the Stanford Large Network Dataset Collection (SNAP) dataset and self-built Dimension Networks dataset. The experimental results show that the proposed method is effective.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

An Improved Parallel K-Means Algorithm Based on Cloud Computing

Big Data Clustering Algorithm Based on Computer Cloud Platform

Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop

Article Open access 05 September 2017

Chowdam Sreedhar, Nagulapally Kasiviswanath & Pakanti Chenna Reddy

References

Tan, Z.H., Jia, W.M., Jin, W.: Robust adaptive beamforming using k-means clustering: A solution to high complexity of the reconstruction-based algorithm[J]. Radioengineering. 27(2), 595–601 (2018)
Article Google Scholar
Xu, L., Lü, J.: Bayberry image segmentation based on homomorphic filtering and K-means clustering algorithm[J]. Transactions of the Chinese Society of Agricultural Engineering. 31(14), 202–208 (2015)
Google Scholar
Wang, Z., Kaiyi, W., Shouhui, P., et al.: Segmentation of crop disease images with an improved K-means clustering algorithm[J]. Appl. Eng. Agric. 34(2), 277–289 (2018)
Article Google Scholar
Beom-Su, K., Monther, A., Ki-Il, K.: An efficient real-time data dissemination multicast protocol for big data in wireless sensor networks[J]. Journal of Grid Computing. 17(2), 341–355 (2019)
Article Google Scholar
Kurnianingsih, K., Nugroho, L.E., Widyawan, W., et al.: Personalized adaptive system for elderly care in smart home using fuzzy inference system[J]. International Journal of Pervasive Computing and Communications. 14(4), 210–232 (2018)
Article Google Scholar
Cai Q Q , Cui H G , Tang H . Big data mining analysis method based on cloud computing[C]// green energy and sustainable development I: Proceedings of the International Conference on Green Energy and Sustainable Development (GESD 2017). 2017: 020028-1–020028-4
Li, H., Wang, S., Tang, R.: Research on the high robustness data classification and the mining algorithm based on hierarchical clustering and KNN[C]// international conference on communication & electronics systems. IEEE. 1–6 (2017)
Sun H, Ji G, Zhao B, et al. A parallel algorithm for mining time relaxed gradual clustering pattern based on spatio-temporal trajectories[C]// Fifth International Conference on Advanced Cloud & Big Data. 2017 :308–313
Daoping X , Alin Z , Yubo L . A parallel clustering algorithm implementation based on apache mahout[C]// 2016 Sixth international conference on Instrumentation & measurement, computer, communication and control (IMCCC). IEEE, 2016: 790–795
Jabakji, A., Dag, H.: Improving item-based recommendation accuracy with user's preferences on apache mahout[C]// 2016 IEEE international conference on big data (big data). IEEE. 1742–1749 (2016)
Mai, X., Cheng, J., Wang, S.: Research on semi supervised K-means clustering algorithm in data mining[J]. Clust. Comput. 21(9), 1–8 (2018)
Google Scholar
Wang, Z., Wang, K., Pan, S.: Segmentation of crop disease images with an improved K-means clustering algorithm[J]. Appl. Eng. Agric. 34(2), 277–289 (2018)
Article MathSciNet Google Scholar
Mehdizadeh, E., Teimouri, M., Zaretalab, A., et al.: A combined approach based on K-means and modified electromagnetism-like mechanism for data clustering[J]. International Journal of Information Technology & Decision Making. 16(5), 1279–1307 (2017)
Article Google Scholar
Hua Z, Zhou X. A novel clustering algorithm combining niche genetic algorithm with canopy and K-means[C]// International Conference on Artificial Intelligence & Big Data. 2018:26–32
Nathiya, G., Punitha, S.C., Punithavalli, M.: An analytical study on behavior of clusters using K means, EM and K* means algorithm[J]. International Journal of Computer Science & Information Security. 7(3), 21–30 (2010)
Google Scholar
Zhang, T., Ma, F.: Improved rough k-means clustering algorithm based on weighted distance measure with Gaussian function[J]. Int. J. Comput. Math. 94(4), 663–675 (2017)
Article MATH MathSciNet Google Scholar
Chunlin, L., Jianhang, T., Youlong, L.: Hybrid cloud adaptive scheduling strategy for heterogeneous workloads[J]. Journal of Grid Computing. 17(4), 1–28 (2019)
Google Scholar
Kai, P., Leung, V.C.M., Huang, Q.: Clustering approach based on mini batch Kmeans for intrusion detection system over big data[J]. IEEE Access. 6(99), 11897–11906 (2018)
Google Scholar
Righi, R.D.R., Lehmann, M., Gomes, M.M., et al.: A survey on global management view: toward combining system monitoring, resource management, and load prediction[J]. Journal of Grid Computing. 17(9), 1–30 (2019)
Google Scholar
Haut, J.M., Paoletti, M., Plaza, J., et al.: Cloud implementation of the K-means algorithm for hyperspectral image analysis[J]. J. Supercomput. 73(1), 1–16 (2017)
Article Google Scholar
Tong, J.-F.: User clustering based on canopy+K-means algorithm in cloud computing[J]. Journal of Interdisciplinary Mathematics. 20(6–7), 1489–1492 (2017)
Article Google Scholar

Download references

Acknowledgments

This work was supported by The Applied Research Plan of Key Scientific Research Projects in Henan Colleges and Universities (No. 18B520028) and the Technology Plan Project of Henan Science (No. 182102210471).

Author information

Authors and Affiliations

School of Software, Pingdingshan University, Pingdingshan, 467000, Henan, China
Dongliang Xia, Feifei Ning & Weina He

Authors

Dongliang Xia
View author publications
You can also search for this author in PubMed Google Scholar
Feifei Ning
View author publications
You can also search for this author in PubMed Google Scholar
Weina He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dongliang Xia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xia, D., Ning, F. & He, W. Research on Parallel Adaptive Canopy-K-Means Clustering Algorithm for Big Data Mining Based on Cloud Platform. J Grid Computing 18, 263–273 (2020). https://doi.org/10.1007/s10723-019-09504-z

Download citation

Received: 29 April 2019
Accepted: 06 December 2019
Published: 02 January 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s10723-019-09504-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Research on Parallel Adaptive Canopy-K-Means Clustering Algorithm for Big Data Mining Based on Cloud Platform

Abstract

Access this article

Similar content being viewed by others

An Improved Parallel K-Means Algorithm Based on Cloud Computing

Big Data Clustering Algorithm Based on Computer Cloud Platform

Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Research on Parallel Adaptive Canopy-K-Means Clustering Algorithm for Big Data Mining Based on Cloud Platform

Abstract

Access this article

Similar content being viewed by others

An Improved Parallel K-Means Algorithm Based on Cloud Computing

Big Data Clustering Algorithm Based on Computer Cloud Platform

Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation