Abstract
Symbolic Data Analysis (SDA) is a relatively new field of statistics that extends conventional data analysis by taking into account intrinsic data variability and structure. Unlike conventional data analysis, in SDA the features characterizing the data can be multi-valued, such as intervals or histograms. SDA has been mainly approached from a sampling perspective. In this work, we propose a model that links the micro-data and macro-data of interval-valued symbolic variables, which takes a populational perspective. Using this model, we derive the micro-data assumptions underlying the various definitions of symbolic covariance matrices proposed in the literature, and show that these assumptions can be too restrictive, raising applicability concerns. We analyze the various definitions using worked examples and four datasets. Our results show that the existence/absence of correlations in the macro-data may not be correctly captured by the definitions of symbolic covariance matrices and that, in real data, there can be a strong divergence between these definitions. Thus, in order to select the most appropriate definition, one must have some knowledge about the micro-data structure.
Similar content being viewed by others
References
Anderson TW (2011) Anderson–Darling tests of goodness-of-fit. In: Lovric M (ed) International encyclopedia of statistical science. Springer, Berlin, pp 52–54
Beranger B, Lin H, Sisson SA (2020) New models for symbolic data analysis. arXiv:1809.03659
Bertrand P, Goupil F (2000) Descriptive statistics for symbolic data. In: Bock HH, Diday E (eds) Analysis of symbolic data, studies in classification, data analysis, and knowledge organization. Springer, Berlin, pp 106–124
Billard L (2008) Sample covariance functions for complex quantitative data. In: Proceedings of World IASC conference, Yokohama, Japan, pp 157–163
Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98:470–487
Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, Hoboken
Bock HH, Diday E (2000) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, New York
Brito P (2014) Symbolic data analysis: another look at the interaction of data mining and statistics. Wiley Interdiscip Rev Data Min Knowl Discov 4(4):281–295
Brito P, Duarte Silva AP (2012) Modelling interval data with normal and skew-normal distributions. J Appl Stat 39(1):3–20
Cazes P, Chouakria A, Diday E, Schektman Y (1997) Extension de l’analyse en composantes principales à des données de type intervalle. Revue de Statistique Appliquée 45(3):5–24
Cheira P, Brito P, Duarte Silva AP (2017) Factor analysis of interval data. arXiv:1709.04851
Chouakria A (1998) Extension des méthodes d’analyse factorielle à des données de type intervalle. Ph.D. thesis, Université Paris-Dauphine
de Carvalho FAT, Lechevallier Y (2009) Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recogn 42(7):1223–1236
de Carvalho FAT, Brito P, Bock HH (2006) Dynamic clustering for interval data based on L2 distance. Comput Stat 21(2):231–250
Dias S, Brito P (2017) Off the beaten track: a new linear model for interval data. Eur J Oper Res 258(3):1118–1130
Diday E (1987) The symbolic approach in clustering and related methods of Data Analysis. In: Bock H (ed) Proceedings of first conference IFCS, Aachen, Germany. North-Holland
Duarte Silva AP, Brito P (2015) Discriminant analysis of interval data: an assessment of parametric and distance-based approaches. J Classif 32(3):516–541
Duarte Silva AP, Filzmoser P, Brito P (2018) Outlier detection in interval data. J Adv Data Anal Classif 12(3):785–822
Filzmoser P, Brito P, Duarte Silva AP (2014) Outlier detection in interval data. In: Gilli M, Gonzalez-Rodriguez G, Nieto-Reyes A (eds) Proceedings of COMPSTAT 2014, p 11
Fox J, Weisberg S (2011) An R companion to applied regression, 2nd edn. Sage, Thousand Oaks
Johnson RA, Wichern DW (2007) Applied multivariate statistical analysis. Prentice-Hall Inc, Upper Saddle River
Le-Rademacher J (2008) Principal component analysis for interval-valued and histogram-valued data and likelihood functions and some maximum likelihood estimators for symbolic data. Ph.D. thesis, University of Georgia, Athens, GA
Le-Rademacher J, Billard L (2011) Likelihood functions and some maximum likelihood estimators for symbolic data. J Stat Plan Inference 141(4):1593–1602
Le-Rademacher J, Billard L (2012) Symbolic covariance principal component analysis and visualization for interval-valued data. Comput Graph Stat 21(2):413–432
Lima Neto EA, Cordeiro GM, de Carvalho FA (2011) Bivariate symbolic regression models for interval-valued variables. J Stat Comput Simul 81(11):1727–1744
Maia ALS, de Carvalho FAT, Ludermir TB (2008) Forecasting models for interval-valued time series. Neurocomputing 71(16–18):3344–3352
Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min ASA Data Sci J 4(2):157–170
Oliveira MR, Vilela M, Pacheco A, Valadas R, Salvador P (2017) Extracting information from interval data using symbolic principal component analysis. Aust J Stat 46:79–87
Queiroz DCF, de Souza RMCR, Cysneiros FJA, Araújo MC (2018) Kernelized inner product-based discriminant analysis for interval data. Pattern Anal Appl 21(3):731–740
R Core Team: R (2015) A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria
Rahman PA, Beranger B, Roughan M, Sisson SA (2020) Likelihood-based inference for modelling packet transit from thinned flow summaries. arXiv:2008.13424
Salvador P, Nogueira A (2014) Customer-side detection of Internet-scale traffic redirection. In: 16th international telecommunications network strategy and planning symposium (Networks 2014), pp 1–5
Sato-Ilic M (2011) Symbolic clustering with interval-valued data. Procedia Comput Sci 6:358–363
Subtil A (2020) Latent class models in the evaluation of biomedical diagnostic tests and internet traffic anomaly detection. Doctoral’s thesis, Instituto Superior Técnico, Universidade de Lisboa, Portugal
Teles P, Brito P (2015) Modeling interval time series with space-time processes. Commun Stat Theory Methods 44(17):3599–3627
Vilela M (2015) Classical and robust symbolic principal component analysis for interval data. Master’s Thesis, Instituto Superior Técnico, Universidade de Lisboa, Portugal
Wang H, Guan R, Wu J (2012) CIPCA: complete-information-based principal component analysis for interval-valued data. Neurocomputing 86:158–169
Zhang X, Sisson SA (2020) Constructing likelihood functions for interval-valued random variables. Scand J Stat 47:1–35
Acknowledgements
This research has been supported by Fundação para a Ciência e Tecnologia (FCT), Portugal, through the projects UIDB/04621/2020, UIDB/50008/2020, PTDC/EEI-TEL/32454/2017, and PTDC/EGE-ECO/30535/2017. We thank the reviewers for their constructive comments and suggestions, which greatly enriched the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Oliveira, M.R., Azeitona, M., Pacheco, A. et al. Association measures for interval variables. Adv Data Anal Classif 16, 491–520 (2022). https://doi.org/10.1007/s11634-021-00445-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-021-00445-8