Skip to main content
Log in

Robust clustering of multiply censored data via mixtures of t factor analyzers

  • Original Paper
  • Published:
TEST Aims and scope Submit manuscript

Abstract

Mixtures of t factor analyzers (MtFA) have been well recognized as a prominent tool in modeling and clustering multivariate data contaminated with heterogeneity and outliers. In certain practical situations, however, data are likely to be censored such that the standard methodology becomes computationally complicated or even infeasible. This paper presents an extended framework of MtFA that can accommodate censored data, referred to as MtFAC in short. For maximum likelihood estimation, we construct an alternating expectation conditional maximization algorithm in which the E-step relies on the first-two moments of truncated multivariate-t distributions and CM-steps offer tractable solutions of updated estimators. Asymptotic standard errors of mixing proportions and component mean vectors are derived by means of missing information principle, or the so-called Louis’ method. Several numerical experiments are conducted to examine the finite-sample properties of estimators and the ability of the proposed model to downweight the impact of censoring and outlying effects. Further, the efficacy and usefulness of the proposed method are also demonstrated by analyzing a real dataset with genuine censored observations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Aitken AC (1926) On Bernoulli’s numerical solution of algebraic equations. Proc R Soc Edinb 46:289–305

    MATH  Google Scholar 

  • Andrews JL, McNicholas PD (2011) Extending mixtures of multivariate \(t\)-factor analyzers. Stat Comput 21:361–373

    MathSciNet  MATH  Google Scholar 

  • Arellano-Valle RB, Genton MG (2005) On fundamental skew distributions. J Multivar Anal 96:93–116

    MathSciNet  MATH  Google Scholar 

  • Arellano-Valle RB, Castro LM, Gonzalez-Farías G, Munos Gajardo K (2012) Student-\(t\) censored regression model: properties and inference. Stat Methods Appl 21:453–473

    MathSciNet  MATH  Google Scholar 

  • Azzalini A, Capitanio A (2003) Distributions generated by perturbation of symmetry with emphasis on a multivariate skew \(t\)-distribution. J R Stat Soc Ser B 65:367–389

    MathSciNet  MATH  Google Scholar 

  • Baek J, McLachlan GJ (2011) Mixtures of common \(t\)-factor analyzers for clustering high-dimensional microarray data. Bioinformatics 27:1269–1276

    Google Scholar 

  • Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821

    MathSciNet  MATH  Google Scholar 

  • Baudry JP, Raftery AE, Celeux G, Lo K, Gottardo R (2010) Combining mixture components for clustering. J Comput Graph Stat 9:332–353

    MathSciNet  Google Scholar 

  • Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22:719–725

    Google Scholar 

  • Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal 41:561–575

    MathSciNet  MATH  Google Scholar 

  • Böhning D, Dietz E, Schaub R, Schlattmann P, Lindsay B (1994) The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Ann Inst Stat Math 46:373–388

    MATH  Google Scholar 

  • Branco M, Dey D (2001) A general class of multivariate skew-elliptical distribution. J Multivar Anal 79:93–113

    MathSciNet  MATH  Google Scholar 

  • Cabral CR, Lachos VH, Prates MO (2012) Multivariate mixture modeling using skew-normal independent distributions. Comput Stat Data Anal 56:126–142

    MathSciNet  MATH  Google Scholar 

  • Carin L, Baraniuk RG, Cevher V, Dunson D, Jordan MI, Sapiro G, Wakin MB (2011) Learning low-dimensional signal models. IEEE Signal Process Mag 28:381–396

    Google Scholar 

  • Castro LM, Costa DR, Prates MO, Lachos VH (2015) Likelihood-based inference for Tobit confirmatory factor analysis using the multivariate Student-\(t\) distribution. Stat Comput 25:1163–1183

    MathSciNet  MATH  Google Scholar 

  • Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28:781–793

    Google Scholar 

  • Cohen AC (1957) On the solution of estimating equations for truncated and censored samples from normal populations. Biometrika 44:225–236

    MathSciNet  MATH  Google Scholar 

  • Cohen AC (1959) Simplified estimators for the normal distribution when samples are singly censored or truncated. Technometrics 1:217–237

    MathSciNet  Google Scholar 

  • Costa DR, Lachos VH, Bazan JL, Azevedo CLN (2014) Estimation methods for multivariate Tobit confirmatory factor analysis. Comput Stat Data Anal 79:248–260

    MathSciNet  MATH  Google Scholar 

  • Cramér H (1946) Mathematical methods of Statistics. Princeton University Press, Princeton

    MATH  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 9:1–38

    MATH  Google Scholar 

  • Fokoué E, Titterington DM (2003) Mixtures of factor analyzers. Bayesian estimation and inference by stochastic simulation. Mach Learn 50:73–94

    MATH  Google Scholar 

  • Fraley C (1998) Algorithms for model-based Gaussian hierarchical clustering. SIAM J Sci Comput 20:270–281

    MathSciNet  MATH  Google Scholar 

  • Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41:578–588

    MATH  Google Scholar 

  • Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–612

    MathSciNet  MATH  Google Scholar 

  • Galarza CE, Lachos VH (2019) MomTrunc: moments of folded and doubly truncated multivariate distributions. R Package Version 4.51. http://CRAN.R-project.org/package=MomTrunc. Accessed 1 Mar 2021

  • Galarza CE, Lin TI, Wang WL, Lachos VH (2021) On moments of folded and truncated multivariate Student-t distributions based on recurrence relations. Metrika. https://doi.org/10.1007/s00184-020-00802-1

    Article  MathSciNet  MATH  Google Scholar 

  • Ghahramani Z, Hinton GE (1997) The EM algorithm for factor analyzers. Technical report no. CRG-TR-96-1. The University of Toronto, Toronto

  • Hartigan JA, Wong MA (1979) Algorithm AS 136: a \(K\)-means clustering algorithm. J R Stat Soc Ser C 28:100–108

    MATH  Google Scholar 

  • He J (2013) Mixture model based multivariate statistical analysis of multiply censored environmental data. Adv Water Resour 59:15–24

    Google Scholar 

  • Hinton GE, Dayan P, Revow M (1997) Modeling the manifolds of images of handwritten digits. IEEE Trans Neural Netw 8:65–74

    Google Scholar 

  • Ho HJ, Lin TI, Chen HY, Wang WL (2012) Some results on the truncated multivariate \(t\) distribution. J Stat Plan Inference 142:25–40

    MathSciNet  MATH  Google Scholar 

  • Hubert LJ, Arabie P (1985) Comparing partitions. J Classif 2:193–218

    MATH  Google Scholar 

  • Hughes JP (1999) Mixed-effects models with censored data with application to HIV RNA levels. Biometrics 55:625–629

    MATH  Google Scholar 

  • Kamakura WA, Wedel M (2001) Exploratory Tobit factor analysis for multivariate censored data. Multivar Behav Res 36:53–82

    Google Scholar 

  • Karlis D, Xekalaki E (2003) Choosing initial values for the EM algorithm for finite mixtures. Comput Stat Data Anal 41:577–590

    MathSciNet  MATH  Google Scholar 

  • Lachos VH, López Moreno EJ, Chen K, Cabral CRB (2017) Finite mixture modeling of censored data using the multivariate Student-\(t\) distribution. J Multivar Anal 159:151–167

    MathSciNet  MATH  Google Scholar 

  • Lawley DN, Maxwell AE (1971) Factor analysis as a statistical method, 2nd edn. Butterworths, London

    MATH  Google Scholar 

  • Lee S, McLachlan GJ (2013) On mixtures of skew normal and skew \(t\)-distributions. Adv Data Anal Classif 7:241–266

    MathSciNet  MATH  Google Scholar 

  • Lee S, McLachlan GJ (2016) Finite mixtures of canonical fundamental skew \(t\)-distributions: the unication of the restricted and unrestricted skew \(t\)-mixture models. Stat Comput 26:573–589

    MathSciNet  MATH  Google Scholar 

  • Lehmann EL, Casella G (1998) Theory of point estimation, 2nd edn. Springer, New York

    MATH  Google Scholar 

  • Lin TI, Wang WL (2020) Multivariate-\(t\) linear mixed models with censored responses, intermittent missing values and heavy tails. Stat Methods Med Res 29:1299–1304

    MathSciNet  Google Scholar 

  • Lin TI, Ho HJ, Lee CR (2014) Flexible mixture modelling using the multivariate skew-t-normal distribution. Stat Comput 24:531–546

    MathSciNet  MATH  Google Scholar 

  • Lin TI, McLachlan GJ, Lee SX (2016) Extending mixtures of factor models using the restricted multivariate skew-normal distribution. J Multivar Anal 143:398–413

    MathSciNet  MATH  Google Scholar 

  • Lin TI, Lachos VH, Wang WL (2018a) Multivariate longitudinal data analysis with censored and intermittent missing responses. Stat Med 37:2822–2835

  • Lin TI, Wang WL, McLachlan GJ, Lee SX (2018b) Robust mixtures of factor analysis models using the restricted multivariate skew-\(t\) distribution. Stat Mod 28:50–72

  • Liu CH, Rubin DB (1994) The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81:633–648

    MathSciNet  MATH  Google Scholar 

  • Louis TA (1982) Finding the observed information matrix when using the EM algorithm. J R Stat Soc Ser B 44:226–233

    MathSciNet  MATH  Google Scholar 

  • Matos LA, Prates MO, Chen MH, Lachos VH (2013) Likelihood-based inference for mixed-effects models with censored response using the multivariate-\(t\) distribution. Stat Sin 23:1323–1345

    MathSciNet  MATH  Google Scholar 

  • McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  • McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York

    MATH  Google Scholar 

  • McLachlan GJ, Bean RW, Peel D (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18:413–422

    Google Scholar 

  • McLachlan GJ, Bean RW, Jones LBT (2007) Extension of the mixture of factor analyzers model to incorporate the multivariate \(t\)-distribution. Comput Stat Data Anal 51:5327–5338

    MathSciNet  MATH  Google Scholar 

  • Meng XL, van Dyk D (1997) The EM algorithm-an old folk song sung to a fast new tune. J R Stat Soc Ser B 59:511–567

    MathSciNet  MATH  Google Scholar 

  • Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80:267–278

    MathSciNet  MATH  Google Scholar 

  • Murray PM, Browne RP, McNicholas PD (2014) Mixtures of skew-\(t\) factor analyzers. Comput Stat Data Anal 77:326–335

    MathSciNet  MATH  Google Scholar 

  • Murray PM, Browne RP, McNicholas PD (2017) A mixture of SDB skew-\(t\) factor analyzers. Econ Stat 3:160–168

    MathSciNet  Google Scholar 

  • Murray PM, Browne RP, McNicholas PD (2020) Mixtures of hidden truncation hyperbolic factor analyzers. J Classif 37:366–379

    MathSciNet  MATH  Google Scholar 

  • Muthén BO (1989) Tobit factor analysis. Br J Math Stat Psychol 42:241–250

    MathSciNet  MATH  Google Scholar 

  • Orchard T, Woodbury MA (1972) A missing information principle: theory and applications. In: Proceedings of the 6th Berkeley symposium on mathematical statistics and probability, vol 1, pp 697–715

  • Peel D, McLachlan GJ (2000) Robust mixture modeling using the \(t\) distribution. Stat Comput 10:339–348

    Google Scholar 

  • Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, De Jager PL, Mesirov JP (2009) Automated high-dimensional flow cytometric data analysis. Proc Natl Acad Sci USA 106:8519–8524

    Google Scholar 

  • R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria

    Google Scholar 

  • Redner RA, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26:195–239

    MathSciNet  MATH  Google Scholar 

  • Sahu SK, Dey DK, Branco MD (2003) A new class of multivariate skew distributions with application to Bayesian regression models. Can J Stat 31:129–150

    MathSciNet  MATH  Google Scholar 

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464

    MathSciNet  MATH  Google Scholar 

  • Scrucca L, Fop M, Murphy TB, Raftery AE (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8:205–233

    Google Scholar 

  • Shumway RH, Azari RS, Johnson P (1989) Estimating mean concentrations under transformation for environmental data with detection limits. Technometrics 31:347–56

    Google Scholar 

  • Spearman C (1904) General intelligence, objectively determined and measured. Am J Psychol 15:201–293

    Google Scholar 

  • Tobin J (1958) Estimation of relationships for limited dependent variables. Econometrica 26:24–36

    MathSciNet  MATH  Google Scholar 

  • Ueda N, Nakano R, Ghahramani Z, Hinton GE (2000) SMEM algorithm for mixture models. Neural Comput 12:2109–2128

    Google Scholar 

  • VDEQ (2003) The quality of Virginia non-tidal streams: first year report. VDEQ technical bulletin WQA/2002–2001, Office of Water Quality and Assessments, Virginia Department of Environmental Quality

  • VDEQ (2008) Virginia Water Quality Assessment. Integrated report 305(b)/303(d), Virginia Department of Environmental Quality

  • VDEQ (2009) Virginia Water Quality Standards. Technical report regulation 9 VAC 25–260, State Water Control Board, Virginia Department of Environmental Quality

  • Wang WL, Castro LM, Lachos VH, Lin TI (2019) Model-based clustering of censored data via mixtures of factor analyzers. Comput Stat Data Anal 140:104–121

    MathSciNet  MATH  Google Scholar 

  • Zeller CB, Cabral CRB, Lachos VH, Benites L (2019) Finite mixture of regression models for censored data based on scale mixtures of normal distributions. Adv Data Anal Classif 13:89–116

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge the editors and two anonymous referees for their insightful comments and constructive suggestions that greatly improved the quality of this paper. We are also grateful to Ms. Ting-Yu Lin for her skillful assistance in initial simulations and help sketching some graphs. This research was supported by the Ministry of Science and Technology of Taiwan under Grant Nos. 107-2628-M-035-001-MY3 and 109-2118-M-005-005-MY3.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tsung-I Lin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 36 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, WL., Lin, TI. Robust clustering of multiply censored data via mixtures of t factor analyzers. TEST 31, 22–53 (2022). https://doi.org/10.1007/s11749-021-00766-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11749-021-00766-y

Keywords

Mathematics Subject Classification

Navigation