Skip to main content
Log in

Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

This paper deals with similarity measures for categorical data in hierarchical clustering, which can deal with variables with more than two categories, and which aspire to replace the simple matching approach standardly used in this area. These similarity measures consider additional characteristics of a dataset, such as a frequency distribution of categories or the number of categories of a given variable. The paper recognizes two main aims. First, to compare and evaluate the selected similarity measures regarding the quality of produced clusters in hierarchical clustering. Second, to propose new similarity measures for nominal variables. All the examined similarity measures are compared regarding the quality of the produced clusters using the mean ranked scores of two internal evaluation coefficients. The analysis is performed on the generated datasets, and thus, it allows determining in which particular situations a certain similarity measure is recommended for use.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Anderberg, M. R. (1973). Cluster analysis for applications. Probability and mathematical statistics. New York: Academic Press.

    MATH  Google Scholar 

  • Boriah, S., Chandola, V., Kumar, V. (2008). Similarity measures for categorical data: a comparative evaluation. In Proceedings of the eighth SIAM International Conference on Data Mining (pp. 243–254).

  • Chandola, V., Boriah, S., Kumar, V. (2009). A framework for exploring categorical data. In Proceedings of the ninth SIAM International Conference on Data Mining (pp. 187–198): SIAM.

  • Chatuverdi, A., Foods, K., Green, P. E., Carroll, J. D. (2001). K-modes clustering. Journal of Classification, 18(1), 35–55.

    Article  MathSciNet  Google Scholar 

  • Chen, L., & Guo, G. (2014). Centroid-based classification of categorical data. In Li, F., Li, G., Hwang, S.-w., Yao, B., Zhang, Z. (Eds.) Web-age information management (pp. 472–475). Cham: Springer International Publishing.

  • Deza, M. M., & Deza, E. (2009). Encyclopedia of distances. Berlin: Springer.

    Book  MATH  Google Scholar 

  • Eskin, E., Arnold, A., Prerau, M., Portnoy, L., Stolfo, S. (2002). A geometric framework for unsupervised anomaly detection, (pp. 77–101). Boston: Springer US.

    Google Scholar 

  • Everitt, B., Landau, S., Leese, M., Stahl, D. (2011). Cluster analysis. Wiley series in probability and statistics. New York: Wiley.

    MATH  Google Scholar 

  • Goodall, D. W. (1966). A new similarity index based on probability. Biometrics, 22(4), 882–907.

    Article  Google Scholar 

  • Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27(4), 857–871.

    Article  Google Scholar 

  • Hennig, C., Meila, M., Murtagh, F., Rocci, R. (2015). Handbook of cluster analysis. Chapman & Hall/CRC Handbooks of modern statistical methods. Taylor & Francis.

  • Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.

    Article  Google Scholar 

  • Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50.

    Article  Google Scholar 

  • Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning (pp. 296–304): Morgan Kaufmann.

  • Morlini, I., & Zani, S. (2012). A new class of weighted similarity indices using polytomous variables. Journal of Classification, 29(2), 199–226.

    Article  MathSciNet  MATH  Google Scholar 

  • Qiu, W., & Joe, H. (2015). clusterGeneration: random cluster generation (with specified degree of separation). R package version 1.3.4.

  • Qiu, W., & Joe, H. (2016). Generation of random clusters with specified degree of separation. Journal of Classification, 23(2), 315–334.

    Article  MathSciNet  MATH  Google Scholar 

  • Řezanková, H. (2009). Cluster analysis and categorical data. Statistika, 89(2), 216–232.

    Google Scholar 

  • Řezanková, H., Löster, T., Húsek, D. (2011). Evaluation of categorical data clustering. Advances in Intelligent Web Mastering, 3, 173–182.

    Article  Google Scholar 

  • San, O. M., Huynh, V. N., Nakamori, Y. (2004). An alternative extension of the k-means algorithm for clustering categorical data. International Journal of Applied Mathematics and Computer Science, 14(2), 241–247.

    MathSciNet  MATH  Google Scholar 

  • Shannon, C. E. (2001). A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), 3–55.

    Article  MathSciNet  Google Scholar 

  • Sokal, R. R., & Michener, C. D. (1958). A statistical method for evaluating systematic relationships. University of Kansas Scientific Bulletin, 28, 1409–1438.

    Google Scholar 

  • Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.

    Article  Google Scholar 

  • Strauss, T., & von Maltitz, M. J. (2017). Generalising Ward’s method for use with Manhattan distances. PLoS ONE, 12(1), 1–21.

    Article  Google Scholar 

  • Šulc, Z., & Řezanková, H. (2015). nomclust: an R package for hierarchical clustering of objects characterized by nominal variables. In Proceedings of the 9th International Days of Statistics and Economics (pp. 1581–1590). Slaný: Melandrium.

  • Todeschini, R., Consonni, J., Xiang, H., Holliday, V., Buscema, M., Willett, P. (2012). Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. Journal of Chemical Information and Modeling, 52(11), 2884–2901.

    Article  Google Scholar 

  • Warrens, M. J. (2008). Similarity coefficients for binary data. Ph.D. thesis, University of Leiden.

  • Warrens, M. J. (2016). Inequalities between similarities for numerical data. Journal of Classification, 33(2), 141–148.

    Article  MathSciNet  MATH  Google Scholar 

  • Yi, J., Yang, G., Wan, J. (2016). Category discrimination based feature selection algorithm in chinese text classification. Journal of Information Science and Engineering, 32(5), 1145–1159.

    MathSciNet  Google Scholar 

  • Yim, O., & Ramdeen, K. T. (2015). Hierarchical cluster analysis: comparison of three linkage measures and application to psychological data. The Quantitative Methods for Psychology, 11(1), 8–21.

    Article  Google Scholar 

Download references

Funding

This paper was supported by the University of Economics, Prague under the IGA project no. F4/41/2016.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zdeněk Šulc.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Table 11 Rank orders of all combinations of similarity measures and linkage methods (PSFE)
Table 12 Rank orders of all combinations of similarity measures and linkage methods (PSFM)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Šulc, Z., Řezanková, H. Comparison of Similarity Measures for Categorical Data in Hierarchical Clustering. J Classif 36, 58–72 (2019). https://doi.org/10.1007/s00357-019-09317-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-019-09317-5

Keywords

Navigation