Skip to main content
Log in

Association measures for interval variables

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Symbolic Data Analysis (SDA) is a relatively new field of statistics that extends conventional data analysis by taking into account intrinsic data variability and structure. Unlike conventional data analysis, in SDA the features characterizing the data can be multi-valued, such as intervals or histograms. SDA has been mainly approached from a sampling perspective. In this work, we propose a model that links the micro-data and macro-data of interval-valued symbolic variables, which takes a populational perspective. Using this model, we derive the micro-data assumptions underlying the various definitions of symbolic covariance matrices proposed in the literature, and show that these assumptions can be too restrictive, raising applicability concerns. We analyze the various definitions using worked examples and four datasets. Our results show that the existence/absence of correlations in the macro-data may not be correctly captured by the definitions of symbolic covariance matrices and that, in real data, there can be a strong divergence between these definitions. Thus, in order to select the most appropriate definition, one must have some knowledge about the micro-data structure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  • Anderson TW (2011) Anderson–Darling tests of goodness-of-fit. In: Lovric M (ed) International encyclopedia of statistical science. Springer, Berlin, pp 52–54

    Chapter  Google Scholar 

  • Beranger B, Lin H, Sisson SA (2020) New models for symbolic data analysis. arXiv:1809.03659

  • Bertrand P, Goupil F (2000) Descriptive statistics for symbolic data. In: Bock HH, Diday E (eds) Analysis of symbolic data, studies in classification, data analysis, and knowledge organization. Springer, Berlin, pp 106–124

    MATH  Google Scholar 

  • Billard L (2008) Sample covariance functions for complex quantitative data. In: Proceedings of World IASC conference, Yokohama, Japan, pp 157–163

  • Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98:470–487

    Article  MathSciNet  Google Scholar 

  • Billard L, Diday E (2006) Symbolic data analysis: conceptual statistics and data mining. Wiley, Hoboken

    Book  Google Scholar 

  • Bock HH, Diday E (2000) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, New York

    Book  Google Scholar 

  • Brito P (2014) Symbolic data analysis: another look at the interaction of data mining and statistics. Wiley Interdiscip Rev Data Min Knowl Discov 4(4):281–295

    Article  Google Scholar 

  • Brito P, Duarte Silva AP (2012) Modelling interval data with normal and skew-normal distributions. J Appl Stat 39(1):3–20

    Article  MathSciNet  Google Scholar 

  • Cazes P, Chouakria A, Diday E, Schektman Y (1997) Extension de l’analyse en composantes principales à des données de type intervalle. Revue de Statistique Appliquée 45(3):5–24

    Google Scholar 

  • Cheira P, Brito P, Duarte Silva AP (2017) Factor analysis of interval data. arXiv:1709.04851

  • Chouakria A (1998) Extension des méthodes d’analyse factorielle à des données de type intervalle. Ph.D. thesis, Université Paris-Dauphine

  • de Carvalho FAT, Lechevallier Y (2009) Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recogn 42(7):1223–1236

    Article  Google Scholar 

  • de Carvalho FAT, Brito P, Bock HH (2006) Dynamic clustering for interval data based on L2 distance. Comput Stat 21(2):231–250

    Article  Google Scholar 

  • Dias S, Brito P (2017) Off the beaten track: a new linear model for interval data. Eur J Oper Res 258(3):1118–1130

    Article  MathSciNet  Google Scholar 

  • Diday E (1987) The symbolic approach in clustering and related methods of Data Analysis. In: Bock H (ed) Proceedings of first conference IFCS, Aachen, Germany. North-Holland

  • Duarte Silva AP, Brito P (2015) Discriminant analysis of interval data: an assessment of parametric and distance-based approaches. J Classif 32(3):516–541

    Article  MathSciNet  Google Scholar 

  • Duarte Silva AP, Filzmoser P, Brito P (2018) Outlier detection in interval data. J Adv Data Anal Classif 12(3):785–822

    Article  MathSciNet  Google Scholar 

  • Filzmoser P, Brito P, Duarte Silva AP (2014) Outlier detection in interval data. In: Gilli M, Gonzalez-Rodriguez G, Nieto-Reyes A (eds) Proceedings of COMPSTAT 2014, p 11

  • Fox J, Weisberg S (2011) An R companion to applied regression, 2nd edn. Sage, Thousand Oaks

    Google Scholar 

  • Johnson RA, Wichern DW (2007) Applied multivariate statistical analysis. Prentice-Hall Inc, Upper Saddle River

    MATH  Google Scholar 

  • Le-Rademacher J (2008) Principal component analysis for interval-valued and histogram-valued data and likelihood functions and some maximum likelihood estimators for symbolic data. Ph.D. thesis, University of Georgia, Athens, GA

  • Le-Rademacher J, Billard L (2011) Likelihood functions and some maximum likelihood estimators for symbolic data. J Stat Plan Inference 141(4):1593–1602

    Article  MathSciNet  Google Scholar 

  • Le-Rademacher J, Billard L (2012) Symbolic covariance principal component analysis and visualization for interval-valued data. Comput Graph Stat 21(2):413–432

    Article  MathSciNet  Google Scholar 

  • Lima Neto EA, Cordeiro GM, de Carvalho FA (2011) Bivariate symbolic regression models for interval-valued variables. J Stat Comput Simul 81(11):1727–1744

    Article  MathSciNet  Google Scholar 

  • Maia ALS, de Carvalho FAT, Ludermir TB (2008) Forecasting models for interval-valued time series. Neurocomputing 71(16–18):3344–3352

    Article  Google Scholar 

  • Noirhomme-Fraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min ASA Data Sci J 4(2):157–170

    Article  MathSciNet  Google Scholar 

  • Oliveira MR, Vilela M, Pacheco A, Valadas R, Salvador P (2017) Extracting information from interval data using symbolic principal component analysis. Aust J Stat 46:79–87

    Article  Google Scholar 

  • Queiroz DCF, de Souza RMCR, Cysneiros FJA, Araújo MC (2018) Kernelized inner product-based discriminant analysis for interval data. Pattern Anal Appl 21(3):731–740

    Article  MathSciNet  Google Scholar 

  • R Core Team: R (2015) A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria

  • Rahman PA, Beranger B, Roughan M, Sisson SA (2020) Likelihood-based inference for modelling packet transit from thinned flow summaries. arXiv:2008.13424

  • Salvador P, Nogueira A (2014) Customer-side detection of Internet-scale traffic redirection. In: 16th international telecommunications network strategy and planning symposium (Networks 2014), pp 1–5

  • Sato-Ilic M (2011) Symbolic clustering with interval-valued data. Procedia Comput Sci 6:358–363

    Article  Google Scholar 

  • Subtil A (2020) Latent class models in the evaluation of biomedical diagnostic tests and internet traffic anomaly detection. Doctoral’s thesis, Instituto Superior Técnico, Universidade de Lisboa, Portugal

  • Teles P, Brito P (2015) Modeling interval time series with space-time processes. Commun Stat Theory Methods 44(17):3599–3627

    Article  MathSciNet  Google Scholar 

  • Vilela M (2015) Classical and robust symbolic principal component analysis for interval data. Master’s Thesis, Instituto Superior Técnico, Universidade de Lisboa, Portugal

  • Wang H, Guan R, Wu J (2012) CIPCA: complete-information-based principal component analysis for interval-valued data. Neurocomputing 86:158–169

    Article  Google Scholar 

  • Zhang X, Sisson SA (2020) Constructing likelihood functions for interval-valued random variables. Scand J Stat 47:1–35

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This research has been supported by Fundação para a Ciência e Tecnologia (FCT), Portugal, through the projects UIDB/04621/2020, UIDB/50008/2020, PTDC/EEI-TEL/32454/2017, and PTDC/EGE-ECO/30535/2017. We thank the reviewers for their constructive comments and suggestions, which greatly enriched the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Rosário Oliveira.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Oliveira, M.R., Azeitona, M., Pacheco, A. et al. Association measures for interval variables. Adv Data Anal Classif 16, 491–520 (2022). https://doi.org/10.1007/s11634-021-00445-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-021-00445-8

Keywords

Mathematics Subject Classification

Navigation