Abstract
Symbolic data is aggregated from bigger traditional datasets in order to hide entry specific details and to enable analysing large amounts of data, like big data, which would otherwise not be possible. Symbolic data may appear in many different but complex forms like intervals and histograms. Identifying patterns and finding similarities between objects is one of the most fundamental tasks of data mining. In order to accurately cluster these sophisticated data types, usual methods are not enough. Throughout the years different approaches have been proposed but they mainly concentrate on the “macroscopic” similarities between objects. Distributional data, for example symbolic data, has been aggregated from sets of large data and thus even the smallest microscopic differences and similarities become extremely important. In this paper a method is proposed for clustering distributional data based on these microscopic similarities by using quantile values. Having multiple points for comparison enables to identify similarities in small sections of distribution while producing more adequate hierarchical concepts. Proposed algorithm, called microscopic hierarchical conceptual clustering, has a monotone property and has been found to produce more adequate conceptual clusters during experimentation. Furthermore, thanks to the usage of quantiles, this algorithm allows us to compare different types of symbolic data easily without any additional complexity.
Similar content being viewed by others
References
Bertrand P, Mufti GB (2008) Stability measures for assessing a partition and its clusters: application to symbolic data sets. In: Symbolic data analysis and the SODAS software, pp 263–278
Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, Hoboken
Brito P, De Carvalho FdA (2008) Hierarchical and pyramidal clustering. In: Symbolic data analysis and the sodas software, pp 157–180
Brito P, Ichino M (2010) Symbolic clustering based on quantile representation. In: Proceedings of COMPSTAT2010, pp 22–27
Brito P, Ichino M (2011) Conceptual clustering of symbolic data using a quantile representation: discrete and continuous approaches. In: Proceeding of theory and application of high-dimensional complex and symbolic data analysis in economics and management science, pp 22–27
de Carvalho FdA, de Souza RM (2010) Unsupervised pattern recognition models for mixed feature-type symbolic data. Pattern Recogn Lett 31(5):430–443
De Carvalho FDA, Lechevallier Y, Verde R (2008) Clustering methods in symbolic data analysis. In: Symbolic data analysis and the sodas software, pp 181–204
Diday E, Esposito F (2003) An introduction to symbolic data analysis and the sodas software. Intell Data Anal 7(6):583–601
El-Sonbaty Y, Ismail MA (1998) On-line hierarchical clustering. Pattern Recogn Lett 19(14):1285–1291
Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172
Goswami S, Chakrabarti A (2012) Quartile clustering: a quartile based technique for generating meaningful clusters. J Comput 4(2):48–55
Guru D, Nagendraswamy H (2006) Clustering of interval-valued symbolic patterns based on mutual similarity value and the concept of k-mutual nearest neighborhood. In: Asian conference on computer vision, Springer, Berlin, pp 234–243
Hardy A, Lallemand P (2002) Determination of the number of clusters for symbolic objects described by interval variables. In: Classification, clustering, and data analysis, Springer, Berlin, pp 311–318
Hu X (1992) Conceptual clustering and concept hierarchies in knowledge discovery. Ph.D. thesis, theses (School of Computing Science)/Simon Fraser University
Hubert L (1972) Some extensions of Johnson’s hierarchical clustering algorithms. Psychometrika 37(3):261–274
Ichino M (2008) Symbolic PCA for histogram-valued data. In: Proceedings IASC, pp 5–8
Ichino M (2011) The quantile method for symbolic principal component analysis. Stat Anal Data Min: ASA Data Sci J 4(2):184–198
Ichino M, Britto P (2014) The data accumulation graph (DAQ) to visualize multi- dimensional symbolic data. In: Workshop in symbolic data analysis, Taipei, Taiwan
Ichino M, Brito P (2013) A hierarchical conceptual clustering based on the quantile method for mixed feature-type data. In: Proceedings of world statistics congress of the international statistical institute
Ichino M, Umbleja K (2018) Similarity and dissimilarity measures for mixed feature-type symbolic data. In: Studies in theoretical and applied statistics, Springer, Berlin, pp 131–144
Ichino M, Yaguchi H (1994) Generalized minkowski metrics for mixed feature-type data analysis. IEEE Trans Syst Man Cybern 24(4):698–708
Irpino A, Verde R (2006) A new wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Data science and classification, Springer, Berlin, pp 185–192
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32(3):241–254
Jonyer I, Cook DJ, Holder LB (2001) Graph-based hierarchical conceptual clustering. J Mach Learn Res 2:19–43
Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: 2010 IEEE international conference on data mining, IEEE, pp 911–916
Michalski RS, Stepp RE (1983) Learning from observation: conceptual clustering. In: Machine learning, Springer, Berlin, pp 331–363
National Climatic Data Center (2014) Tables of histogram data. Climate-vegetation atlas of North America. http://www1.ncdc.noaa.gov/pub/data/cirs/drd/drd964x.tmpst.txt. Accessed 10 Aug 2015
Umbleja K (2017) Competence based learning—framework, implementation, analysis and management of learning process. Ph.D. thesis, Theses (School of Information Technologies)/Tallinn University of Technology, https://digi.lib.ttu.ee/i/?7573. Accessed 4 Oct 2018
US Geological Survey (2013) Tables of histogram data. Climate-vegetation atlas of North America. http://pubs.usgs.gov/pp/p1650-b/datatables/hgtable.xls. Accessed 24 Aug 2015
Vendramin L, Campello RJ, Hruschka ER (2010) Relative clustering validity criteria: a comparative overview. Stat Anal Data Min: ASA Data Sci J 3(4):209–235
Acknowledgements
The authors want to thank reviewers for their helpful comments. Kadri Umbleja’s work has been supported by Japan Society for the Promotion of Science’s International Research Fellow program.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Implementation of algorithm in Python can be found at: https://github.com/iardacil/MHCC
Rights and permissions
About this article
Cite this article
Umbleja, K., Ichino, M. & Yaguchi, H. Hierarchical conceptual clustering based on quantile method for identifying microscopic details in distributional data. Adv Data Anal Classif 15, 407–436 (2021). https://doi.org/10.1007/s11634-020-00411-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-020-00411-w