Skip to main content
Log in

IM-c-means: a new clustering algorithm for clusters with skewed distributions

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

In this paper, a new clustering algorithm, IM-c-means, is proposed for clusters with skewed distributions. C-means algorithm is a well-known and widely used strategy for data clustering, but at the same time prone to poor performance if the data set is not distributed uniformly, which is called “uniform effect” in studies. We first analyze the cause of this effect and find that it occurs only when clusters sizes are varied, whereas different object densities inter-clusters have no effect on c-means algorithm. According to this finding, we propose to form a new objective function by considering volumes and object densities of all clusters, which creates a new effective clustering algorithm with respect to the clusters with varied sizes or densities, while at the same time inheriting the good performance of traditional c-means algorithm for balanced data set. The experiments using both synthetic and real data sets have provided promising results of the proposed clustering algorithm. In addition, the nonparametric test has showed that the proposed algorithm could offer a significant improvement over other clustering methods for imbalanced data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Azamathulla HM, Ab Ghani A (2010) Genetic programming to predict river pipeline scour. J Pipeline Syst Eng Pract 1(3):127–132

    Article  Google Scholar 

  2. Babuka R, Van der Veen PJ, Kaymak U (2002) Improved covariance estimation for Gustafson-Kessel clustering. In: IEEE International conference on fuzzy systems, pp. 1081–1085

  3. Bae E, Bailey J, Dong GZ (2010) A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings. Data Min Knowl Disc 21(3):427–471

    Article  MathSciNet  Google Scholar 

  4. Belo LDS, Jr CAC, Guimarães SJF (2016) Summarizing video sequence using a graph-based hierarchical approach. Neurocomputing 173(P3):1001–1016

  5. Ben-Hur A, Horn D, Siegelmann HT, Vapnik V (2002) Support vector clustering. J Mach Learn Res 2(2):125–137

    MATH  Google Scholar 

  6. Cao F, Liang J, Jiang G (2009) An initialization method for the k-means algorithm using neighborhood model. Comput Math Appl 58(3):474–483

    Article  MathSciNet  Google Scholar 

  7. Carvalho FDATD, Simões EC, Santana LVC, Ferreira MRP (2018) Gaussian Kernel c-means hard clustering algorithms with automated computation of the width hyper-parameters. Pattern Recogn 79:370–386

    Article  Google Scholar 

  8. Deng Z, Jiang Y, Chung FL, Ishibuchi H, Choi KS, Wang S (2016) Transfer prototype-based fuzzy clustering. IEEE Trans Fuzzy Syst 24(5):1210–1232

    Article  Google Scholar 

  9. Ferreira MR, De Carvalho FDA (2014) Kernel fuzzy c-means with automatic variable weighting. Fuzzy Sets Syst 237:1–46

    Article  MathSciNet  Google Scholar 

  10. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701

    Article  Google Scholar 

  11. Gath I, Geva AB (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 11(7):773–780

    Article  Google Scholar 

  12. He H, Tan YH (2012) A two-stage genetic algorithm for automatic clustering. Neurocomputing 81:49–59

    Article  Google Scholar 

  13. Ismkhan H (2018) I-k-means-+: an iterative clustering algorithm based on an enhanced version of the k -means. Pattern Recogn 79:402–413

    Article  Google Scholar 

  14. Jain AK (2015) Data clustering: a review. ACM Comput Surv 31(2):264–323

    Google Scholar 

  15. Krishna K, Murty MN (1999) Genetic k-means algorithm. IEEE Trans Syst Man Cybern B Cybern 29(3):433–9

    Article  Google Scholar 

  16. Leung HC, Yiu SM, Yang B, Peng Y, Wang Y, Liu Z, Chen J, Qin J, Li R, Chin FY (2011) A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics 27(11):1489–95

    Article  Google Scholar 

  17. Liang JY, Bai L, Dang CY, Cao FY (2012) The k-means-type algorithms versus imbalanced data distributions. IEEE Trans Fuzzy Syst 20(4):728–745

    Article  Google Scholar 

  18. Liao R, Zhang R, Guan J, Zhou S (2014) A new unsupervised binning approach for metagenomic sequences based on n-grams and automatic feature weighting. IEEE/ACM Trans Comput Biol Bioinf 11(1):42–54

    Article  Google Scholar 

  19. Lin PL, Huang PW, Kuo CH, Lai YH (2014) A size-insensitive integrity-based fuzzy c-means method for data clustering. Pattern Recogn 47(5):2042–2056

    Article  Google Scholar 

  20. Liu J, Xu M (2008) Kernelized fuzzy attribute c-means clustering algorithm. Fuzzy Sets Syst 159(18):2428–2445

    Article  MathSciNet  Google Scholar 

  21. Liu Y, Hou T, Liu F (2015) Improving fuzzy c-means method for unbalanced dataset. Electron Lett 51(23):1880–1881

    Article  Google Scholar 

  22. Noordam JC, van den Broek WHAM, Buydens LMC (2002) Multivariate image segmentation with cluster size insensitive fuzzy c-means. Chemometr Intell Lab Syst 64(1):65–78

    Article  Google Scholar 

  23. Pérez-Suárez A, Martínez-Trinidad JF, Carrasco-Ochoa JA, Medina-Pagola JE (2013) OClustR: a new graph-based algorithm for overlapping clustering. Neurocomputing 121(18):234–247

    Article  Google Scholar 

  24. Ramathilagam S, Huang YM (2011) Extended gaussian kernel version of fuzzy c-means in the problem of data analyzing. Expert Syst Appl 38(4):3793–3805

    Article  Google Scholar 

  25. Ruiz C, Spiliopoulou M, Menasalvas E (2010) Density-based semi-supervised clustering. Data Min Knowl Disc 21(3):345–370

    Article  MathSciNet  Google Scholar 

  26. Siddiqui FU, Isa NAM (2012) Optimized k-means (okm) clustering algorithm for image segmentation. Opto-Electron Rev 20(3):216–225

    Article  Google Scholar 

  27. Tseng LY, Yang SB (2001) A genetic approach to the automatic clustering problem. Pattern Recogn 34(2):415–424

    Article  Google Scholar 

  28. Tu Q, Lu JF, Yuan B, Tang JB, Yang JY (2012) Density-based hierarchical clustering for streaming data. Pattern Recogn Lett 33(5):641–645

    Article  Google Scholar 

  29. Wang CD, Lai JH, Zhu JY (2012) Graph-based multiprototype competitive learning and its applications. IEEE Trans Syst Man Cybern Part C 42(6):934–946

    Article  Google Scholar 

  30. Wang Y, Leung HC, Yiu SM, Chin FY (2012) Metacluster 4.0: a novel binning algorithm for NGS reads and huge number of species. J Comput Biol 19(2):241–249

    Article  Google Scholar 

  31. Xiong H, Wu J, Chen J (2009) K-means clustering versus validation measures: a data-distribution perspective. IEEE Trans Syst Man Cybern B Cybern 39(2):318–331

    Article  Google Scholar 

  32. Zhou K, Yang S (2019) Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering. Pattern Anal Appl 23:255

    Google Scholar 

  33. Zhou KL, Yang SL (2016) Exploring the uniform effect of FCM clustering: a data distribution perspective. Knowl Based Syst 96:76–83

    Article  Google Scholar 

  34. Zhu Y, Ting KM, Carman MJ (2016) Density-ratio based clustering for discovering clusters with varying densities. Pattern Recogn 60:983–997

    Article  Google Scholar 

Download references

Acknowledgements

This work has been supported by the National Natural Science Foundation of China (61503151), the Natural Science Foundation of Jilin Province (20160520100JH) and the Project funded by China Postdoctoral Science Foundation (2019M651204).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fu Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Hou, T., Miao, Y. et al. IM-c-means: a new clustering algorithm for clusters with skewed distributions. Pattern Anal Applic 24, 611–623 (2021). https://doi.org/10.1007/s10044-020-00932-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-020-00932-2

Keywords

Navigation