Abstract
There are many incomplete data sets in all fields of scientific studies due to random noise, data lost, limitations of data acquisition, data misunderstanding etc. Most of the clustering algorithms can not be used for incomplete data sets directly because objects with missing values need to be preprocessed. For this reason, this paper presents an improved mean imputation clustering algorithm for incomplete data based on partition clustering algorithm. In the proposed method, we divide the universe into two sets: the set of objects with non-missing values and the set of objects with missing values. Firstly, the objects with non-missing values are clustered by traditional clustering algorithm. For each object with missing values, we use the mean attribute’s value of each cluster to fill the missing attribute’s value based on the cluster results of the objects with non-missing values, respectively. Perturbation analysis of cluster centroid is applied to search the optimal imputation. The experimental clustering results on some UCI data sets are evaluated by several validity indexes, which proves the effectiveness of the proposed algorithm.
Similar content being viewed by others
References
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16:645–678
Aggarwal CC, Reddy CK (2013) Data clustering: algorithms and applications. Chapman & Hall/CRC Press, Boca raton
Wang PX, Yao YY (2018) CE3: a three-way clustering method based on mathematical morphology. Knowl-Based Syst 155:54–65
Wang PX, Shi H, Yang XB, Mi JS (2019) Three-way k-means: integrating k-means and three-way decision. Int J Mach Learn Cybernet 10:2767–2777
Yang XB, Qi YS, Song XN, Yang JY (2013) Test cost sensitive multigranulation rough set: model and minimal cost selection. Inf Sci 250:184–199
Qian YH, Cheng HH, Wang JT, Liang JY, Pedrycz W, Dang CY (2017) Grouping granular structures in human granulation intelligence. Inf Sci 382–383:150–169
Yang XB, Yao YY (2018) Ensemble selector for attribute reduction. Appl Soft Comput 70:1–11
Elalami ME (2011) Supporting image retrieval framework with rule base system. Knowl-Based Syst 24:331–340
Sebiskveradze D, Vrabie V, Gobinet C (2011) Automation of an algorithm based on fuzzy clustering for analyzing tumoral heterogeneity in human skin carcinoma tissue sections. Lab Investig 91:799–811
Kalyani S, Swarup KS (2011) Particle swarm optimization based k-means clustering approach for security assessment in power systems. Expert Syst Appl 38:10839–10846
Wu YH (2015) General overview on clustering algorithms. Comput Sci 42:491–499
Jain AK (2008) Data clustering: 50 years beyond k-means. In: 2008 European conference on machine learning and principles and practice of knowledge discovery in databases. Springer, Berlin, pp 3–4
Macqueen J (1967) Some methods for classification and analysis of multi-variate observations. In: 1967 Proceeding of Berkeley symposium on mathematical statistics and probability conference, pp 281–297
Arthur D, Vassilvitskii S (2007) K-Means++: the advantages of careful seeding. In: ACM-SIAM symposium on discrete algorithms (SODA’07), New Orleans, LA, pp 1027–1035
Park HS, Jun CH (2009) A simple and fast algorithm for K-medoids clustering. Expert Syst Appl 36:3336–3341
Yu SS, Chu SW, Wang CM, Chan YK, Chang TC (2018) Two improved k-means algorithms. Appl Soft Comput 68:747–755
Franti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recogn 93:95–112
Chao GQ (2019) Discriminative k-means laplacian clustering. Neural Process Lett 49:393–405
Honarkhah M, Caers J (2010) Stochastic simulation of patterns using distance-based pattern modeling. Math Geosci 42:487–517
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
Little RJA, Rubin DB (2002) Statistical analysis with missing data. Technometrics 45:364–365
Pang XS (2012) Comparative study on interpolation processing method of missing data. Stat Decis 24:18–22
Dempster A (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39:1–38
Grzymala-Busse JW, Fu M (2000) A comparison of several approaches to missing attribute values in data mining. In: Proceedings of the 2nd international conference on rough sets and current trends in computing. Springer, Berlin, pp 378–385
Doquire G, Verleysen M (2012) Feature selection with missing data using mutual information estimators. Neurocomputing 90:3–11
Jason VH, Khoshgoftaar TM (2014) Incomplete-case nearest neighbor imputation in software measurement data. Inf Sci 259:586–610
Hathaway RJ, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Trans Syst Man Cybern Part B Cybern 31:735–744
Zhang DQ, Chen SC (2003) Clustering incomplete data using kernel-based fuzzy c-means algorithm. Neural Process Lett 18:155–162
Li D, Gu H, Zhang LY (2010) A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data. Expert Syst Appl 37:6942–6947
Li D, Zhang LY, Gu H (2012) An attribute weighted fuzzy c-means algorithm for incomplete data clustering. J Dalian Univ Technol 52:449–453
Li D, Gu H, Zhang LY (2013) A hybrid genetic algorithm fuzzy c-means approach for incomplete data clustering based on nearest-neighbor intervals. Soft Comput 17:1787–1796
Li JH, Song SJ, Zhang YL, Li K (2017) A robust fuzzy c-means clustering algorithm for incomplete data. In: 2017 International conference on life system modeling and simulation & 2017 international conference on intelligent computing for sustainable energy and environment, vol 762, pp 3–12 (2017)
Su T, Yu H (2016) Three-way decision clustering algorithm for incomplete data based on q-nearest neighbors. J Frontiers Comput Sci Technol 10:875–883
Shi QY, Liang JY, Zhao XW (2016) A clustering ensemble algorithm for incomplete mixed data. J Comput Res Develop 53:1979–1989
Mesquita DPP, Gomes JPP, Rodrigues LR (2019) Artificial neural networks with random weights for incomplete datasets. Neural Process Lett. https://doi.org/10.1007/s11063-019-10012-0
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1:224–227
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml
Acknowledgements
The authors would like to thank the editor and the anonymous reviewers for their constructive and valuable comments. This work was supported in part by National Natural Science Foundation of China (Nos. 61503160 and 61773012), Natural Science Foundation of the Jiangsu Higher Education Institutions of China (No. 15KJB110004), Postgraduate Research & Practice Innovation Program of Jiangsu Province (No. KYCX19_1699).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shi, H., Wang, P., Yang, X. et al. An Improved Mean Imputation Clustering Algorithm for Incomplete Data. Neural Process Lett 54, 3537–3550 (2022). https://doi.org/10.1007/s11063-020-10298-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-020-10298-5