Robust clustering of multiply censored data via mixtures of t factor analyzers

Wang, Wan-Lun; Lin, Tsung-I

doi:10.1007/s11749-021-00766-y

Robust clustering of multiply censored data via mixtures of t factor analyzers

Original Paper
Published: 08 April 2021

Volume 31, pages 22–53, (2022)
Cite this article

TEST Aims and scope Submit manuscript

288 Accesses
5 Citations
Explore all metrics

Abstract

Mixtures of t factor analyzers (MtFA) have been well recognized as a prominent tool in modeling and clustering multivariate data contaminated with heterogeneity and outliers. In certain practical situations, however, data are likely to be censored such that the standard methodology becomes computationally complicated or even infeasible. This paper presents an extended framework of MtFA that can accommodate censored data, referred to as MtFAC in short. For maximum likelihood estimation, we construct an alternating expectation conditional maximization algorithm in which the E-step relies on the first-two moments of truncated multivariate-t distributions and CM-steps offer tractable solutions of updated estimators. Asymptotic standard errors of mixing proportions and component mean vectors are derived by means of missing information principle, or the so-called Louis’ method. Several numerical experiments are conducted to examine the finite-sample properties of estimators and the ability of the proposed model to downweight the impact of censoring and outlying effects. Further, the efficacy and usefulness of the proposed method are also demonstrated by analyzing a real dataset with genuine censored observations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust clustering via mixtures of t factor analyzers with incomplete data

Article 05 July 2021

Finite mixture modeling of censored and missing data using the multivariate skew-normal distribution

Article 17 June 2021

Robust model-based clustering via mixtures of skew-t distributions with missing information

Article 17 November 2015

References

Aitken AC (1926) On Bernoulli’s numerical solution of algebraic equations. Proc R Soc Edinb 46:289–305
MATH Google Scholar
Andrews JL, McNicholas PD (2011) Extending mixtures of multivariate \(t\)-factor analyzers. Stat Comput 21:361–373
MathSciNet MATH Google Scholar
Arellano-Valle RB, Genton MG (2005) On fundamental skew distributions. J Multivar Anal 96:93–116
MathSciNet MATH Google Scholar
Arellano-Valle RB, Castro LM, Gonzalez-Farías G, Munos Gajardo K (2012) Student-\(t\) censored regression model: properties and inference. Stat Methods Appl 21:453–473
MathSciNet MATH Google Scholar
Azzalini A, Capitanio A (2003) Distributions generated by perturbation of symmetry with emphasis on a multivariate skew \(t\)-distribution. J R Stat Soc Ser B 65:367–389
MathSciNet MATH Google Scholar
Baek J, McLachlan GJ (2011) Mixtures of common \(t\)-factor analyzers for clustering high-dimensional microarray data. Bioinformatics 27:1269–1276
Google Scholar
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821
MathSciNet MATH Google Scholar
Baudry JP, Raftery AE, Celeux G, Lo K, Gottardo R (2010) Combining mixture components for clustering. J Comput Graph Stat 9:332–353
MathSciNet Google Scholar
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22:719–725
Google Scholar
Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal 41:561–575
MathSciNet MATH Google Scholar
Böhning D, Dietz E, Schaub R, Schlattmann P, Lindsay B (1994) The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Ann Inst Stat Math 46:373–388
MATH Google Scholar
Branco M, Dey D (2001) A general class of multivariate skew-elliptical distribution. J Multivar Anal 79:93–113
MathSciNet MATH Google Scholar
Cabral CR, Lachos VH, Prates MO (2012) Multivariate mixture modeling using skew-normal independent distributions. Comput Stat Data Anal 56:126–142
MathSciNet MATH Google Scholar
Carin L, Baraniuk RG, Cevher V, Dunson D, Jordan MI, Sapiro G, Wakin MB (2011) Learning low-dimensional signal models. IEEE Signal Process Mag 28:381–396
Google Scholar
Castro LM, Costa DR, Prates MO, Lachos VH (2015) Likelihood-based inference for Tobit confirmatory factor analysis using the multivariate Student-\(t\) distribution. Stat Comput 25:1163–1183
MathSciNet MATH Google Scholar
Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28:781–793
Google Scholar
Cohen AC (1957) On the solution of estimating equations for truncated and censored samples from normal populations. Biometrika 44:225–236
MathSciNet MATH Google Scholar
Cohen AC (1959) Simplified estimators for the normal distribution when samples are singly censored or truncated. Technometrics 1:217–237
MathSciNet Google Scholar
Costa DR, Lachos VH, Bazan JL, Azevedo CLN (2014) Estimation methods for multivariate Tobit confirmatory factor analysis. Comput Stat Data Anal 79:248–260
MathSciNet MATH Google Scholar
Cramér H (1946) Mathematical methods of Statistics. Princeton University Press, Princeton
MATH Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 9:1–38
MATH Google Scholar
Fokoué E, Titterington DM (2003) Mixtures of factor analyzers. Bayesian estimation and inference by stochastic simulation. Mach Learn 50:73–94
MATH Google Scholar
Fraley C (1998) Algorithms for model-based Gaussian hierarchical clustering. SIAM J Sci Comput 20:270–281
MathSciNet MATH Google Scholar
Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41:578–588
MATH Google Scholar
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–612
MathSciNet MATH Google Scholar
Galarza CE, Lachos VH (2019) MomTrunc: moments of folded and doubly truncated multivariate distributions. R Package Version 4.51. http://CRAN.R-project.org/package=MomTrunc. Accessed 1 Mar 2021
Galarza CE, Lin TI, Wang WL, Lachos VH (2021) On moments of folded and truncated multivariate Student-t distributions based on recurrence relations. Metrika. https://doi.org/10.1007/s00184-020-00802-1
Article MathSciNet MATH Google Scholar
Ghahramani Z, Hinton GE (1997) The EM algorithm for factor analyzers. Technical report no. CRG-TR-96-1. The University of Toronto, Toronto
Hartigan JA, Wong MA (1979) Algorithm AS 136: a \(K\)-means clustering algorithm. J R Stat Soc Ser C 28:100–108
MATH Google Scholar
He J (2013) Mixture model based multivariate statistical analysis of multiply censored environmental data. Adv Water Resour 59:15–24
Google Scholar
Hinton GE, Dayan P, Revow M (1997) Modeling the manifolds of images of handwritten digits. IEEE Trans Neural Netw 8:65–74
Google Scholar
Ho HJ, Lin TI, Chen HY, Wang WL (2012) Some results on the truncated multivariate \(t\) distribution. J Stat Plan Inference 142:25–40
MathSciNet MATH Google Scholar
Hubert LJ, Arabie P (1985) Comparing partitions. J Classif 2:193–218
MATH Google Scholar
Hughes JP (1999) Mixed-effects models with censored data with application to HIV RNA levels. Biometrics 55:625–629
MATH Google Scholar
Kamakura WA, Wedel M (2001) Exploratory Tobit factor analysis for multivariate censored data. Multivar Behav Res 36:53–82
Google Scholar
Karlis D, Xekalaki E (2003) Choosing initial values for the EM algorithm for finite mixtures. Comput Stat Data Anal 41:577–590
MathSciNet MATH Google Scholar
Lachos VH, López Moreno EJ, Chen K, Cabral CRB (2017) Finite mixture modeling of censored data using the multivariate Student-\(t\) distribution. J Multivar Anal 159:151–167
MathSciNet MATH Google Scholar
Lawley DN, Maxwell AE (1971) Factor analysis as a statistical method, 2nd edn. Butterworths, London
MATH Google Scholar
Lee S, McLachlan GJ (2013) On mixtures of skew normal and skew \(t\)-distributions. Adv Data Anal Classif 7:241–266
MathSciNet MATH Google Scholar
Lee S, McLachlan GJ (2016) Finite mixtures of canonical fundamental skew \(t\)-distributions: the unication of the restricted and unrestricted skew \(t\)-mixture models. Stat Comput 26:573–589
MathSciNet MATH Google Scholar
Lehmann EL, Casella G (1998) Theory of point estimation, 2nd edn. Springer, New York
MATH Google Scholar
Lin TI, Wang WL (2020) Multivariate-\(t\) linear mixed models with censored responses, intermittent missing values and heavy tails. Stat Methods Med Res 29:1299–1304
MathSciNet Google Scholar
Lin TI, Ho HJ, Lee CR (2014) Flexible mixture modelling using the multivariate skew-t-normal distribution. Stat Comput 24:531–546
MathSciNet MATH Google Scholar
Lin TI, McLachlan GJ, Lee SX (2016) Extending mixtures of factor models using the restricted multivariate skew-normal distribution. J Multivar Anal 143:398–413
MathSciNet MATH Google Scholar
Lin TI, Lachos VH, Wang WL (2018a) Multivariate longitudinal data analysis with censored and intermittent missing responses. Stat Med 37:2822–2835
Lin TI, Wang WL, McLachlan GJ, Lee SX (2018b) Robust mixtures of factor analysis models using the restricted multivariate skew-\(t\) distribution. Stat Mod 28:50–72
Liu CH, Rubin DB (1994) The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81:633–648
MathSciNet MATH Google Scholar
Louis TA (1982) Finding the observed information matrix when using the EM algorithm. J R Stat Soc Ser B 44:226–233
MathSciNet MATH Google Scholar
Matos LA, Prates MO, Chen MH, Lachos VH (2013) Likelihood-based inference for mixed-effects models with censored response using the multivariate-\(t\) distribution. Stat Sin 23:1323–1345
MathSciNet MATH Google Scholar
McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley, New York
MATH Google Scholar
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York
MATH Google Scholar
McLachlan GJ, Bean RW, Peel D (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18:413–422
Google Scholar
McLachlan GJ, Bean RW, Jones LBT (2007) Extension of the mixture of factor analyzers model to incorporate the multivariate \(t\)-distribution. Comput Stat Data Anal 51:5327–5338
MathSciNet MATH Google Scholar
Meng XL, van Dyk D (1997) The EM algorithm-an old folk song sung to a fast new tune. J R Stat Soc Ser B 59:511–567
MathSciNet MATH Google Scholar
Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80:267–278
MathSciNet MATH Google Scholar
Murray PM, Browne RP, McNicholas PD (2014) Mixtures of skew-\(t\) factor analyzers. Comput Stat Data Anal 77:326–335
MathSciNet MATH Google Scholar
Murray PM, Browne RP, McNicholas PD (2017) A mixture of SDB skew-\(t\) factor analyzers. Econ Stat 3:160–168
MathSciNet Google Scholar
Murray PM, Browne RP, McNicholas PD (2020) Mixtures of hidden truncation hyperbolic factor analyzers. J Classif 37:366–379
MathSciNet MATH Google Scholar
Muthén BO (1989) Tobit factor analysis. Br J Math Stat Psychol 42:241–250
MathSciNet MATH Google Scholar
Orchard T, Woodbury MA (1972) A missing information principle: theory and applications. In: Proceedings of the 6th Berkeley symposium on mathematical statistics and probability, vol 1, pp 697–715
Peel D, McLachlan GJ (2000) Robust mixture modeling using the \(t\) distribution. Stat Comput 10:339–348
Google Scholar
Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, De Jager PL, Mesirov JP (2009) Automated high-dimensional flow cytometric data analysis. Proc Natl Acad Sci USA 106:8519–8524
Google Scholar
R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria
Google Scholar
Redner RA, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26:195–239
MathSciNet MATH Google Scholar
Sahu SK, Dey DK, Branco MD (2003) A new class of multivariate skew distributions with application to Bayesian regression models. Can J Stat 31:129–150
MathSciNet MATH Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
MathSciNet MATH Google Scholar
Scrucca L, Fop M, Murphy TB, Raftery AE (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8:205–233
Google Scholar
Shumway RH, Azari RS, Johnson P (1989) Estimating mean concentrations under transformation for environmental data with detection limits. Technometrics 31:347–56
Google Scholar
Spearman C (1904) General intelligence, objectively determined and measured. Am J Psychol 15:201–293
Google Scholar
Tobin J (1958) Estimation of relationships for limited dependent variables. Econometrica 26:24–36
MathSciNet MATH Google Scholar
Ueda N, Nakano R, Ghahramani Z, Hinton GE (2000) SMEM algorithm for mixture models. Neural Comput 12:2109–2128
Google Scholar
VDEQ (2003) The quality of Virginia non-tidal streams: first year report. VDEQ technical bulletin WQA/2002–2001, Office of Water Quality and Assessments, Virginia Department of Environmental Quality
VDEQ (2008) Virginia Water Quality Assessment. Integrated report 305(b)/303(d), Virginia Department of Environmental Quality
VDEQ (2009) Virginia Water Quality Standards. Technical report regulation 9 VAC 25–260, State Water Control Board, Virginia Department of Environmental Quality
Wang WL, Castro LM, Lachos VH, Lin TI (2019) Model-based clustering of censored data via mixtures of factor analyzers. Comput Stat Data Anal 140:104–121
MathSciNet MATH Google Scholar
Zeller CB, Cabral CRB, Lachos VH, Benites L (2019) Finite mixture of regression models for censored data based on scale mixtures of normal distributions. Adv Data Anal Classif 13:89–116
MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge the editors and two anonymous referees for their insightful comments and constructive suggestions that greatly improved the quality of this paper. We are also grateful to Ms. Ting-Yu Lin for her skillful assistance in initial simulations and help sketching some graphs. This research was supported by the Ministry of Science and Technology of Taiwan under Grant Nos. 107-2628-M-035-001-MY3 and 109-2118-M-005-005-MY3.

Author information

Authors and Affiliations

Department of Statistics, Graduate Institute of Statistics and Actuarial Science, Feng Chia University, Taichung, Taiwan
Wan-Lun Wang
Institute of Statistics, National Chung Hsing University, Taichung, Taiwan
Tsung-I Lin
Department of Public Health, China Medical University, Taichung, Taiwan
Tsung-I Lin

Authors

Wan-Lun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tsung-I Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tsung-I Lin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 36 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, WL., Lin, TI. Robust clustering of multiply censored data via mixtures of t factor analyzers. TEST 31, 22–53 (2022). https://doi.org/10.1007/s11749-021-00766-y

Download citation

Received: 19 July 2020
Accepted: 18 February 2021
Published: 08 April 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s11749-021-00766-y

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust clustering of multiply censored data via mixtures of t factor analyzers

Abstract

Access this article

Similar content being viewed by others

Robust clustering via mixtures of t factor analyzers with incomplete data

Finite mixture modeling of censored and missing data using the multivariate skew-normal distribution

Robust model-based clustering via mixtures of skew-t distributions with missing information

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary material 1 (pdf 36 KB)

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Robust clustering of multiply censored data via mixtures of t factor analyzers

Abstract

Access this article

Similar content being viewed by others

Robust clustering via mixtures of t factor analyzers with incomplete data

Finite mixture modeling of censored and missing data using the multivariate skew-normal distribution

Robust model-based clustering via mixtures of skew-t distributions with missing information

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary material 1 (pdf 36 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation