Abstract
k-means clustering is a well-known procedure for classifying multivariate observations. The resulting centroid matrix of clusters by variables is noted for interpreting which variables characterize clusters. However, between-clusters differences are not always clearly captured in the centroid matrix. We address this problem by proposing a new procedure for obtaining a centroid matrix, so that it has a number of exactly zero elements. This allows easy interpretation of the matrix, as we may focus on only the nonzero centroids. The development of an iterative algorithm for the constrained minimization is described. A cardinality selection procedure for identifying the optimal cardinality is presented, as well as a modified version of the proposed procedure, in which some restrictions are imposed on the positions of nonzero elements. The behaviors of our proposed procedure were evaluated in simulation studies and are illustrated with three real data examples, which demonstrate that the performances of the procedure is promising.
Similar content being viewed by others
References
Adachi, K. (2009). Joint Procrustes analysis for simultaneous nonsingular transformation of component score and loading matrices. Psychometrika, 74, 667–683.
Adachi, K., & Trendafilov, N. T. (2015). Sparse principal component analysis subject to prespecified cardinality of loadings. Computational Statistics, 31, 1–25.
Adachi, K. (2006). Multivariate data analysis. Tokyo: Nakanishiya Shuppan. in Japanese.
Adachi, K., & Trendafilov, N. T. (2017). Sparsest factor analysis for clustering variables: a matrix decomposition approach. Advances in Data Analysis and Classification, 25, 1–29.
Aggarwal, C.C., & Reddy, C.K. (2013). Data clustering: algorithms and applications. Boca Raton: CRC Press.
Alsius, A., Wayne, R. V., Paré, M., & Munhall, K. G. (2016). High visual resolution matters in audiovisual speech perception, but only for some, Attention. Perception, & Psychophysics, 78, 1472–1487.
Basu, S., Davidson, I., & Wagstaff, K. (Eds.). (2008). Constrained clustering: advances in algorithms, theory, and applications. Boca Raton: CRC Press.
Bock, H. H. (1996). Probabilistic models in cluster analysis. Computational Statistics & Data Analysis, 23, 5–28.
Browne, M. W. (2001). An overview of analytic rotation in exploratory factor analysis. Multivariate Behavioral Research, 36, 111–150.
Brusco, M. J., & Cradit, J. D. (2001). A variable-selection heuristic for K-means clustering. Psychometrika, 66, 249–270.
Cortina, L. M., & Wasti, S. A. (2005). Profiles in coping: responses to sexual harassment across persons, organizations, and cultures. Journal of Applied Psychology, 90, 182–192.
Dalton, C., Jennings, E., O’dwyer, B., & Taylor, D. (2016). Integrating observed, inferred and simulated data to illuminate environmental change: a limnological case study. Biology and Environment: Proceedings of the Royal Irish Academy, 116, 279–294.
DeSarbo, W. S., & Mahajan, V. (1984). Constrained classification: the use of a priori information in cluster analysis. Psychometrika, 49, 187–215.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Human Genetics, 7, 179–188.
Fowlkes, E. B., & Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78, 553–569.
Gordon, A.D. (1973). 359. Note: Classification in the presence of constraints. Biometrics, 29, 821–827.
Harman, H. H. (1976). Modern factor analysis, 3rd edn. Chicago: University of Chicago Press.
Hendrickson, A. E., & White, P. O. (1964). PROMAX: a quick method for rotation to oblique simple structure. British Journal of Mathematical and Statistical Psychology, 17, 65–70.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.
Hyland, J. J., Jones, D. L., Parkhill, K. A., Barnes, A. P., & Williams, A. P. (2016). Farmers’ perceptions of climate change: identifying types. Agriculture and Human Values, 33, 323–339.
Jetti, S. K., Vendrell-Llopis, N., & Yaksi, E. (2014). Spontaneous activity governs olfactory representations in spatially organized habenular microcircuits. Current Biology, 24, 434–439.
Kaiser, H. F. (1974). An index of factorial simplicity. Psychometrika, 39, 31–36.
Kuerbis, A., Armeli, S., Muench, F., & Morgenstern, J. (2014). Profiles of confidence and commitment to change as predictors of moderated drinking: a person-centered approach. Psychology of Addictive Behaviors, 28, 1065–1076.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, vol. 1, pp. 281–297.
Miyamoto, S., Ichihashi, H., & Honda, K. (2008). Algorithms for fuzzy clustering. Berlin: Springer.
Peng, X., Zhou, C., & Hepburn, D. M. (2013). Application of K-Means method to pattern recognition in on-line cable partial discharge monitoring. IEEE Transactions on Dielectrics and Electrical Insulation, 20, 754–761.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846–850.
Satomura, H., & Adachi, K. (2013). Oblique rotation in canonical correlation analysis reformulated as maximizing the generalized coefficient of determination. Psychometrika, 78, 526–573.
Schloss, K. B., Hawthorne-Madell, D., & Palmer, S. E. (2015). Ecological influences on individual differences in color preference, Attention. Perception, & Psychophysics, 77, 2803–2816.
Slobodenyuk, N., Jraissati, Y., Kanso, A., Ghanem, L., & Elhajj, I. (2015). Cross-modal associations between color and haptics, Attention. Perception, & Psychophysics, 77, 1379–1395.
Steinley, D. (2006). K-means clustering: a half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59, 1–34.
Steinley, D., & Brusco, M. J. (2008). Selection of variables in cluster analysis: an empirical comparison of eight procedures. Psychometrika, 73, 125–144.
Steinley, D., Brusco, M. J., & Hubert, L. (2016). The variance of the adjusted Rand index. Psychological Methods, 21, 261–272.
Steinley, D., & Hubert, L. (2008). Order-constrained solutions in K-means clustering: even better than being globally optimal. Psychometrika, 73, 647–664.
Thurstone, L.L. (1947). Multiple-factor analysis. Chicago: University of Chicago Press.
Ullman, J. B. (2006). Structural equation modeling: reviewing the basics and moving forward. Journal of Personality Assessment, 87, 33–50.
Yamashita, N. (2012). Canonical correlation analysis formulated as maximizing sum of squared correlations and rotation of structure matrices. The Japanese Journal of Behaviormetrics, 39, 1–9. (in Japanese).
Yamashita, N., & Mayekawa, S. (2015). A new biplot procedure with joint classification of objects and variables by fuzzy c-means clustering. Advances in Data Analysis and Classification, 9, 243—266.
Acknowledgments
The authors are deeply grateful for the helpful comments and suggestions of the associate editor.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yamashita, N., Adachi, K. A Modified k-Means Clustering Procedure for Obtaining a Cardinality-Constrained Centroid Matrix. J Classif 37, 509–525 (2020). https://doi.org/10.1007/s00357-019-09324-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-019-09324-6