Skip to main content
Log in

Clustering multivariate data using factor analytic Bayesian mixtures with an unknown number of components

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Recent work on overfitting Bayesian mixtures of distributions offers a powerful framework for clustering multivariate data using a latent Gaussian model which resembles the factor analysis model. The flexibility provided by overfitting mixture models yields a simple and efficient way in order to estimate the unknown number of clusters and model parameters by Markov chain Monte Carlo sampling. The present study extends this approach by considering a set of eight parameterizations, giving rise to parsimonious representations of the covariance matrix per cluster. A Gibbs sampler combined with a prior parallel tempering scheme is implemented in order to approximately sample from the posterior distribution of the overfitting mixture. The parameterization and number of factors are selected according to the Bayesian information criterion. Identifiability issues related to label switching are dealt by post-processing the simulated output with the Equivalence Classes Representatives algorithm. The contributed method and software are demonstrated and compared to similar models estimated using the expectation–maximization algorithm on simulated and real datasets. The software is available online at https://CRAN.R-project.org/package=fabMix.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Altekar, G., Dwarkadas, S., Huelsenbeck, J.P., Ronquist, F.: Parallel metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 20(3), 407–415 (2004). https://doi.org/10.1093/bioinformatics/btg427

    Article  Google Scholar 

  • Bartholomew, D.J., Knott, M., Moustaki, I.: Latent Variable Models and Factor Analysis: A Unified Approach, vol. 904. Wiley, Hoboken (2011)

    MATH  Google Scholar 

  • Bhattacharya, A., Dunson, D.B.: Sparse Bayesian infinite factor models. Biometrika 98(2), 291–306 (2011)

    MathSciNet  MATH  Google Scholar 

  • Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth International Group, Belmont, CA (1984)

    MATH  Google Scholar 

  • Celeux, G., Hurn, M., Robert, C.P.: Computational and inferential difficulties with mixture posterior distributions. J. Am. Stat. Assoc. 95(451), 957–970 (2000a). https://doi.org/10.1080/01621459.2000.10474285

    Article  MathSciNet  MATH  Google Scholar 

  • Celeux, G., Hurn, M., Robert, C.P.: Computational and inferential difficulties with mixture posterior distributions. J. Am. Stat. Assoc. 95(451), 957–970 (2000b)

    MathSciNet  MATH  Google Scholar 

  • Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Landsman, D., Lockhart, D.J., Davis, R.W.: A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2(1), 65–73 (1998). https://doi.org/10.1016/S1097-2765(00)80114-8

    Article  Google Scholar 

  • Conti, G., Frühwirth-Schnatter, S., Heckman, J.J., Piatek, R.: Bayesian exploratory factor analysis. J. Econom. 183(1), 31–57 (2014)

    MathSciNet  MATH  Google Scholar 

  • Dellaportas, P., Papageorgiou, I.: Multivariate mixtures of normals with unknown number of components. Stat. Comput. 16(1), 57–68 (2006)

    MathSciNet  Google Scholar 

  • Dempster, A.P., Laird, N.M., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B 39, 1–38 (1977)

    MATH  Google Scholar 

  • Eddelbuettel, D., François, R.: Rcpp: seamless R and C++ integration. J. Stat. Softw. 40(8), 1–18 (2011). https://doi.org/10.18637/jss.v040.i08

    Article  Google Scholar 

  • Eddelbuettel, D., Sanderson, C.: Rcpparmadillo: accelerating R with high-performance C++ linear algebra. Comput. Stat. Data Anal. 71, 1054–1063 (2014). https://doi.org/10.1016/j.csda.2013.02.005

    Article  MathSciNet  MATH  Google Scholar 

  • Ferguson, T.S.: A Bayesian analysis of some nonparametric problems. Ann. Stat. 1(2), 209–230 (1973)

    MathSciNet  MATH  Google Scholar 

  • Fokoué, E., Titterington, D.: Mixtures of factor analysers. Bayesian estimation and inference by stochastic simulation. Mach. Learn. 50(1), 73–94 (2003)

    MATH  Google Scholar 

  • Forina, M., Armanino, C., Castino, M., Ubigli, M.: Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25(3), 189–201 (1986)

    Google Scholar 

  • Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002)

    MathSciNet  MATH  Google Scholar 

  • Frühwirth-Schnatter, S., Malsiner-Walli, G.: From here to infinity: sparse finite versus Dirichlet process mixtures in model based clustering. Adv. Data Anal. Classif. 13, 33–64 (2019)

    MathSciNet  MATH  Google Scholar 

  • Gaujoux, R.: doRNG: Generic Reproducible Parallel Backend for ‘foreach’ Loops. https://CRAN.R-project.org/package=doRNG, r package version 1.7.1 (2018)

  • Gelfand, A., Smith, A.: Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc. 85, 398–409 (1990)

    MathSciNet  MATH  Google Scholar 

  • Geman, S., Geman, D.: Stochastic relaxation, gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. PAMI–6(6), 721–741 (1984). https://doi.org/10.1109/TPAMI.1984.4767596

    Article  MATH  Google Scholar 

  • Geweke, J., Zhou, G.: Measuring the pricing error of the arbitrage pricing theory. Rev. Financ. Stud. 9(2), 557–587 (1996). https://doi.org/10.1093/rfs/9.2.557

    Article  Google Scholar 

  • Geyer, C.J.: Markov chain Monte Carlo maximum likelihood. In: Proceedings of the 23rd symposium on the interface, interface foundation, Fairfax Station, Va, pp. 156–163 (1991)

  • Geyer, C.J., Thompson, E.A.: Annealing Markov chain Monte Carlo with applications to ancestral inference. J. Am. Stat. Assoc. 90(431), 909–920 (1995). https://doi.org/10.1080/01621459.1995.10476590

    Article  MATH  Google Scholar 

  • Ghahramani, Z., Hinton, G.E., et al.: The em algorithm for mixtures of factor analyzers. Tech. rep., Technical Report CRG-TR-96-1, University of Toronto (1996)

  • Green, P.J.: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82(4), 711–732 (1995)

    MathSciNet  MATH  Google Scholar 

  • Hager, W.W.: Updating the inverse of a matrix. SIAM Rev. 31(2), 221–239 (1989)

    MathSciNet  MATH  Google Scholar 

  • Ihaka, R., Gentleman, R.: R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5(3), 299–314 (1996). https://doi.org/10.1080/10618600.1996.10474713

    Article  Google Scholar 

  • Kim, J.O., Mueller, C.W.: Factor Analysis: Statistical Methods and Practical Issues, vol. 14. Sage, Thousand Oaks (1978)

    Google Scholar 

  • Ledermann, W.: On the rank of the reduced correlational matrix in multiple-factor analysis. Psychometrika 2(2), 85–93 (1937)

    MATH  Google Scholar 

  • Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 15 Sept 2018

  • Malsiner Walli, G., Frühwirth-Schnatter, S., Grün, B.: Model-based clustering based on sparse finite Gaussian mixtures. Stat. Comput. 26, 303–324 (2016)

    MathSciNet  MATH  Google Scholar 

  • Malsiner Walli, G., Frühwirth-Schnatter, S., Grün, B.: Identifying mixtures of mixtures using bayesian estimation. J. Comput. Graph. Stat. 26, 285–295 (2017)

    MathSciNet  MATH  Google Scholar 

  • Marin, J., Mengersen, K., Robert, C.: Bayesian modelling and inference on mixtures of distributions. Handb. Stat. 25(1), 577–590 (2005)

    MathSciNet  Google Scholar 

  • Mavridis, D., Ntzoufras, I.: Stochastic search item selection for factor analytic models. Br. J. Math. Stat. Psychol. 67(2), 284–303 (2014). https://doi.org/10.1111/bmsp.12019

    Article  MathSciNet  MATH  Google Scholar 

  • McLachlan, J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)

    MATH  Google Scholar 

  • McNicholas, P.D., ElSherbiny, A., Jampani, R.K., McDaid, A.F., Murphy, B., Banks, L.: pgmm: Parsimonious Gaussian Mixture Models. http://CRAN.R-project.org/package=pgmm, R package version 1.2.3 (2015)

  • McNicholas, P.D.: Mixture Model-Based Classification. CRC Press, Boca Raton (2016)

    MATH  Google Scholar 

  • McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput. 18(3), 285–296 (2008)

    MathSciNet  Google Scholar 

  • McNicholas, P.D., Murphy, T.B.: Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26(21), 2705 (2010)

    Google Scholar 

  • McNicholas, P.D., Murphy, T.B., McDaid, A.F., Frost, D.: Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput. Stat. Data Anal. 54(3), 711–723 (2010)

    MathSciNet  MATH  Google Scholar 

  • McParland, D., Phillips, C.M., Brennan, L., Roche, H.M., Gormley, I.C.: Clustering high-dimensional mixed data to uncover sub-phenotypes: joint analysis of phenotypic and genotypic data. Stat. Med. 36(28), 4548–4569 (2017)

    MathSciNet  Google Scholar 

  • Meng, X.L., Van Dyk, D.: The EM algorithm—an old folk-song sung to a fast new tune. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 59(3), 511–567 (1997)

    MathSciNet  MATH  Google Scholar 

  • Murphy, K., Gormley, I.C., Viroli, C.: Infinite mixtures of infinite factor analysers (2019). arXiv preprint arXiv:1701.07010

  • Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000)

    MathSciNet  Google Scholar 

  • Nobile, A., Fearnside, A.T.: Bayesian finite mixtures with an unknown number of components: the allocation sampler. Stat. Comput. 17(2), 147–162 (2007). https://doi.org/10.1007/s11222-006-9014-7

    Article  MathSciNet  Google Scholar 

  • Papastamoulis, P.: fabMix: Overfitting Bayesian mixtures of factor analyzers with parsimonious covariance and unknown number of components (2018a). http://CRAN.R-project.org/package=fabMix, R package version 4.5

  • Papastamoulis, P.: Handling the label switching problem in latent class models via the ECR algorithm. Commun. Stat. Simul. Comput. 43(4), 913–927 (2014)

    MathSciNet  MATH  Google Scholar 

  • Papastamoulis, P.: label.switching: an R package for dealing with the label switching problem in MCMC outputs. J. Stat. Softw. 69(1), 1–24 (2016)

    Google Scholar 

  • Papastamoulis, P.: Overfitting Bayesian mixtures of factor analyzers with an unknown number of components. Comput. Stat. Data Anal. 124, 220–234 (2018b). https://doi.org/10.1016/j.csda.2018.03.007

    Article  MathSciNet  MATH  Google Scholar 

  • Papastamoulis, P., Iliopoulos, G.: Reversible jump MCMC in mixtures of normal distributions with the same component means. Comput. Stat. Data Anal. 53(4), 900–911 (2009)

    MathSciNet  MATH  Google Scholar 

  • Papastamoulis, P., Iliopoulos, G.: An artificial allocations based solution to the label switching problem in Bayesian analysis of mixtures of distributions. J. Comput. Graph. Stat. 19, 313–331 (2010)

    MathSciNet  Google Scholar 

  • Papastamoulis, P., Iliopoulos, G.: On the convergence rate of random permutation sampler and ECR algorithm in missing data models. Methodol. Comput. Appl. Probab. 15(2), 293–304 (2013). https://doi.org/10.1007/s11009-011-9238-7

    Article  MathSciNet  MATH  Google Scholar 

  • Papastamoulis, P., Rattray, M.: BayesBinMix: an R package for model based clustering of multivariate binary data. R J. 9(1), 403–420 (2017)

    Google Scholar 

  • Plummer, M., Best, N., Cowles, K., Vines, K.: CODA: convergence diagnosis and output analysis for MCMC. R News 6(1), 7–11 (2006)

    Google Scholar 

  • R Core Team (2016) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/, ISBN 3-900051-07-0

  • Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)

    Google Scholar 

  • Redner, R.A., Walker, H.F.: Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 26(2), 195–239 (1984)

    MathSciNet  MATH  Google Scholar 

  • Revolution Analytics and Steve Weston (2014) foreach: Foreach looping construct for R. http://CRAN.R-project.org/package=foreach, r package version 1.4.2

  • Revolution Analytics and Steve Weston (2015) doParallel: Foreach Parallel Adaptor for the ’parallel’ Package. http://CRAN.R-project.org/package=doParallel, r package version 1.0.10

  • Richardson, S., Green, P.J.: On Bayesian analysis of mixtures with an unknown number of components. J. R. Stat. Soc. Ser. B 59(4), 731–758 (1997)

    MathSciNet  MATH  Google Scholar 

  • Rousseau, J., Mengersen, K.: Asymptotic behaviour of the posterior distribution in overfitted mixture models. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 73(5), 689–710 (2011)

    MathSciNet  MATH  Google Scholar 

  • Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)

    MathSciNet  MATH  Google Scholar 

  • Scrucca, L., Fop, M., Murphy, T.B., Raftery, A.E.: mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J. 8(1), 205–233 (2017)

    Google Scholar 

  • Stephens, M.: Bayesian analysis of mixture models with an unknown number of components—an alternative to reversible jump methods. Ann. Stat. 28(1), 40–74 (2000)

    MathSciNet  MATH  Google Scholar 

  • Streuli, H.: Der heutige stand der kaffeechemie. In: 6th International Colloquium on Coffee Chemisrty, Association Scientifique International du Cafe, Bogata, Columbia, pp. 61–72 (1973)

  • Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analyzers. Neural Comput. 11(2), 443–482 (1999)

    Google Scholar 

  • van Havre, Z., White, N., Rousseau, J., Mengersen, K.: Overfitting Bayesian mixture models with an unknown number of components. PLoS ONE 10(7), 1–27 (2015)

    Google Scholar 

  • Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E., Ruzzo, W.L.: Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10), 977–987 (2001). https://doi.org/10.1093/bioinformatics/17.10.977

    Article  Google Scholar 

Download references

Acknowledgements

The author would like to acknowledge the assistance given by IT services and use of the Computational Shared Facility of the University of Manchester. The suggestions of two anonymous reviewers helped to improve the findings of this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Panagiotis Papastamoulis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Overfitted mixture model

Assume that the observed data have been generated from a mixture model with \(K_0\) components

$$\begin{aligned} f_{K_0}({\varvec{x}}) = \sum _{k=1}^{K_0}w_kf_k({\varvec{x}}|{\varvec{\theta }}_k), \end{aligned}$$

where \(f_k\in {\mathcal {F}}_{{\varTheta }}=\{f(\cdot |{\varvec{\theta }}):{\varvec{\theta }}\in {\varTheta }\}\); \(k = 1,\ldots ,K_0\) denotes a member of a parametric family of distributions. Consider that an overfitted mixture model \(f_{K}({\varvec{x}})\) with \(K > K_0\) components is fitted to the data. Rousseau and Mengersen (2011) showed that the asymptotic behavior of the posterior distribution of the \(K-K_0\) redundant components depends on the prior distribution of mixing proportions \(({\varvec{w}})\). Let d denote the dimension of free parameters of the distribution \(f_k\). For the case of a Dirichlet prior distribution,

$$\begin{aligned} {\varvec{w}}\sim {\mathcal {D}}\left( \gamma _1,\ldots ,\gamma _K\right) \end{aligned}$$
(A.32)

if

$$\begin{aligned} \max \{\gamma _k; k=1,\ldots ,K\} < d/2 \end{aligned}$$

then the posterior weight of the extra components converges to zero (Theorem 1 of Rousseau and Mengersen (2011)).

Let \(f_K({\varvec{\theta }},{\varvec{z}}|{\varvec{x}})\) denote the joint posterior distribution of model parameters and latent allocation variables for a model with K components. When using an overfitted mixture model, the inference on the number of clusters reduces to (a): choosing a sufficiently large value of mixture components (K), (b): running a typical MCMC sampler for drawing samples from the posterior distribution \(f_K({\varvec{\theta }}, {\varvec{z}}|{\varvec{x}})\) and (c) inferring the number of “alive” mixture components. Note that at MCMC iteration \(t = 1,2,\ldots \) (c) reduces to keeping track of the number of elements in the set \(\varvec{K_0}^{(t)}=\{k=1,\ldots ,K: \sum _{i=1}^{n}I(z_i^{(t)}=k)>0\}\), where \(z_i^{(t)}\) denotes the simulated allocation of observation i at iteration t.

In our case, the dimension of free parameters in the k-th mixture component is equal to \(d = 2p + pq-\frac{q(q-1)}{2}\). Following Papastamoulis (2018b), we set \(\gamma _1=\cdots =\gamma _K = \frac{\gamma }{K}\); thus, the distribution of mixing proportions in Eq. (A.32) becomes

$$\begin{aligned} {\varvec{w}}\sim {\mathcal {D}}\left( \frac{\gamma }{K},\ldots ,\frac{\gamma }{K}\right) \end{aligned}$$
(A.33)

where \(0< \gamma < d/2\) denotes a pre-specified positive number. Such a value is chosen for two reasons. At first, it is smaller than d / 2 so the asymptotic results of Rousseau and Mengersen (2011) ensure that extra components will be emptied as \(n\rightarrow \infty \). Second, this choice can be related to standard practice when using Bayesian nonparametric clustering methods where the parameters of a mixture are drawn from a Dirichlet process (Ferguson 1973), that is, a Dirichlet process mixture model (Neal 2000).

Details of the MCMC sampler

Data normalization and prior parameters Before running the sampler, the raw data are standardized by applying the z-transformation

$$\begin{aligned} \frac{x_{ir} - {\bar{x}}_{r}}{\sqrt{s^2_r}}, \quad i =1,\ldots ,n;r = 1,\ldots ,p \end{aligned}$$

where \({\bar{x}}_{r} = \frac{\sum _{i=1}^{n}x_{ir}}{n}\) and \(s^{2}_{r}= \frac{1}{n-1}\sum _{i=1}^{n}\left( x_{ir}-{\bar{x}}_{r}\right) ^2\). The main reason for using standardized data is that the sampler mixes better. Furthermore, it is easier to choose prior parameters that are not depending on the observed data, that is, using the data twice. In any other case, one could use empirical prior distributions as reported in Fokoué and Titterington (2003), see also Dellaportas and Papageorgiou (2006). For the case of standardized data, the prior parameters are specified in Table 5. Standardized data are also used as input to pgmm.

Table 5 Prior parameter specification for the case of standardized data

Prior parallel tempering It is well known that the posterior surface of mixture models can exhibit many local modes (Celeux et al. 2000b; Marin et al. 2005). In such cases, simple MCMC algorithms may become trapped in minor modes and demand a very large number of iterations to sufficiently explore the posterior distribution. In order to produce a well-mixing MCMC sample and improve the convergence of our algorithm, we utilize ideas from parallel tempering schemes Geyer (1991), Geyer and Thompson (1995), Altekar et al. (2004), where different chains are running in parallel and they are allowed to switch states. Each chain corresponds to a different posterior distribution, and usually each one represents a “heated” version of the target posterior distribution. This is achieved by raising the original target to a power T with \(0\leqslant T \leqslant 1\), which flattens the posterior surface and thus easier to explore when using an MCMC sampler.

In the context of overfitting mixture models, van Havre et al. (2015) introduced a prior parallel tempering scheme, which is also applied by Papastamoulis (2018b). Under this approach, each heated chain corresponds to a model with identical likelihood as the original, but with a different prior distribution. Although the prior tempering can be imposed on any subset of parameters, it is only applied to the Dirichlet prior distribution of mixing proportions (van Havre et al. 2015). Let us denote by \(f_i({\varvec{\varphi }}|{\varvec{x}})\) and \(f_i({\varvec{\varphi }})\); \(i=1,\ldots ,J\), the posterior and prior distribution of the i-th chain, respectively. Obviously, \(f_i({\varvec{\varphi }}|{\varvec{x}}) \propto f({\varvec{x}}|{\varvec{\varphi }})f_i({\varvec{\varphi }})\). Let \({\varvec{\varphi }}^{(t)}_i\) denote the state of chain i at iteration t and assume that a swap between chains i and j is proposed. The proposed move is accepted with probability \(\min \{1,A\}\) where

$$\begin{aligned} A= & {} \frac{f_i({\varvec{\varphi }}_j^{(t)}|{\varvec{x}}) f_j({\varvec{\varphi }}_i^{(t)}|{\varvec{x}})}{f_i({\varvec{\varphi }}_i^{(t)} |{\varvec{x}})f_j({\varvec{\varphi }}_j^{(t)}|{\varvec{x}})}= \frac{f_i({\varvec{\varphi }}_j^{(t)})f_j({\varvec{\varphi }}_i^{(t)})}{f_i ({\varvec{\varphi }}_i^{(t)})f_j({\varvec{\varphi }}_j^{(t)})} \nonumber \\= & {} \frac{{\widetilde{f}}_i( w_j^{(t)}){\widetilde{f}}_j( w_i^{(t)})}{{\widetilde{f}}_i( w_i^{(t)}){\widetilde{f}}_j( w_j^{(t)})}, \end{aligned}$$
(B.34)

and \({\widetilde{f}}_i(\cdot )\) corresponds to the probability density function of the Dirichlet prior distribution related to chain \(i = 1,\ldots ,J\). According to Eq. (A.33), this is

$$\begin{aligned} {\varvec{w}}\sim {\mathcal {D}}\left( \frac{\gamma _{(j)}}{K},\ldots , \frac{\gamma _{(j)}}{K}\right) , \end{aligned}$$
(B.35)

for a pre-specified set of parameters \(\gamma _{(j)}>0\) for \(j = 1,\ldots ,J\).

In our examples, we used a total of \(J = 4\) parallel chains where the prior distribution of mixing proportions for chain j in Eq. (B.35) is selected as

$$\begin{aligned} \gamma _{(j)} = \gamma + \delta (j-1), \quad j = 1,\ldots ,J, \end{aligned}$$

where \(\delta > 0\). For example, when the overfitting mixture model uses \(K=20\) components and \(\gamma = 1\) (the default value shown in Table 5), it follows from Eq. (A.33) that the parameter vector of the Dirichlet prior of mixture weights which corresponds to the target posterior distribution (\(j = 1\)) is equal to \((0.05,\ldots ,0.05)\). Also in our examples, we have used \(\delta =1\), but in general we strongly suggest to tune this parameter until a reasonable acceptance rate is achieved. Each chain runs in parallel, and every 10 iterations we randomly select two adjacent chains \((j,j+1)\), \(j\in \{1,\ldots ,J-1\}\) and propose to swap their current states. A proposed swap is accepted with probability A in Eq. (B.34).

“Overfitting initialization” strategy We briefly describe the “overfitting initialization” procedure introduced by Papastamoulis (2018b). We used an initial period of 500 MCMC iterations where each chain is initialized from totally random starting values, but under a Dirichlet prior distribution with large prior parameter values. These values were chosen in a way that the asymptotic results of Rousseau and Mengersen (2011) guarantee that the redundant mixture components will have non-negligible posterior weights. More specifically for chain j, we assume \( w\sim {\mathcal {D}}(\gamma '_{j},\ldots ,\gamma '_{j})\) with \(\gamma '_{(j)} = \frac{d}{2} + (j-1)\frac{d}{2(J-1)}\), for \(j =1,\ldots ,J\). Then, we initialize the actual model by this state. According to Papastamoulis (2018b), this specific scheme was found to outperform other initialization procedures.

Table 6 Setup for our simulation scenarios with unequal cluster sizes. In all cases, the dimensionality of the multivariate data is equal to \(p=30\)
Fig. 8
figure 8

Examples of simulated datasets with unequal cluster sizes according to scenarios 1 and 2 and \(n = 200\). The legend shows the “true” cluster sizes

Additional simulations

In the simulation section of the manuscript, the weights of the simulated datasets have been randomly generated from a Dirichlet distribution with mean equal to 1 / K, conditional on the number of clusters (K). Thus, on average, the true cluster sizes are equal. In this section, we examine the performance of the proposed method in the presence of unequal cluster sizes with respect to the size (n) of the observed data.

We replicate the simulation mechanism for scenarios 1 and 2 presented in the main text, but now we consider unequal (true) cluster sizes, as detailed in Table 6. For each case, the sample size is increasing (as shown in the last column of Table 6) while keeping all others parameters (that is, the true values of marginal means and factor loadings) constant. As shown in Table 6, in scenario 1 there are five clusters and two factors, whereas in scenario 2 there are two clusters and three factors. In total, three different examples per scenario are considered: for a given scenario, the component-specific parameters are different in each example, but the weights are the same. An instance of our three examples (per scenario) using \(n = 200\) simulated observations is shown in Fig. 8. Observe that in all cases the “true clusters” are not easily distinguishable, especially in scenario 2 where there is a high degree of cluster overlapping.

Fig. 9
figure 9

Adjusted Rand index (first row), estimated number of clusters (second row) and estimated number of factors (third row) for simulated data according to scenario 1 with unequal cluster sizes and increasing sample size. The dotted line in the first row corresponds to the adjusted Rand index between the ground truth and the clustering of the data when applying the Maximum A Posteriori rule using the parameter values that generated the data (C.36). For all examples, the true number of clusters and factors is equal to 5 and 2, respectively

Fig. 10
figure 10

Adjusted Rand index (1st row), estimated number of clusters (second row) and estimated number of factors (third row) for simulated data according to scenario 2 with unequal cluster sizes and increasing sample size. The dotted line in the first row corresponds to the adjusted Rand index between the ground truth and the clustering of the data when applying the Maximum A Posteriori rule using the parameter values that generated the data (C.36). For all examples, the true number of clusters and factors is equal to 2 and 3, respectively

We applied fabMix and pgmm using the same number (4) of parallel chains (for fabMix) and different starts (for pgmm) as in the simulations presented in the main paper. The results are summarized in Figs. 9 and 10 for scenarios 1 and 2, respectively. The adjusted Rand index is displayed in the first line of each figure, where the horizontal axis denotes the sample size (n) of the synthetic data. The dotted black line corresponds to the adjusted Rand index between the ground truth and the cluster assignments arising when applying the Maximum A Posteriori rule using the true parameter values that generated the data, that is,

$$\begin{aligned} {z}_i=\max _{k\in \{1,\ldots ,K^*\}}\left\{ \frac{w_k^*f_k ({\varvec{x}}_i|\theta _k^*)}{\sum _{j=1}^{K^*}w_j^*f_j ({\varvec{x}}_i|\theta _j^*)}\right\} ,\quad i=1,\ldots ,\nonumber \\ \end{aligned}$$
(C.36)

where \(K^*\), \((w_1^*,\ldots ,w^*_{K^*})\) and \((\theta _1^*,\ldots ,\theta ^*_{K^*})\) denote the values of number of components, mixing proportions and parameters of the multivariate normal densities of the mixture model used to generate the data. Observe that in all three examples of scenario 1 the dotted black line is always equal to 1, but this is not the case in the more challenging Scenario 2 due to enhanced levels of cluster overlapping.

The adjusted Rand index between the ground-truth clustering and the estimated cluster assignments arising from fabMix and pgmm is shown in the first row of Figs. 9 and 10. Clearly, the compared methods have similar performance as the sample size increases, but for smaller values of n the proposed method outperforms the pgmm package.

The estimated number of clusters, shown at the second row of Figs. 9 and 10, agrees (in most cases) with the true number of clusters, but note that our method is capable of detecting the right value earlier than pgmm. Two exceptions occur at \(n=200\) for example 2 of scenario 1 where fabMix (red line at second row of Fig. 9) inferred 6 alive clusters instead of 5, as well as at \(n=800\) for example 1 of scenario 2 where fabMix (red line at second row of Fig. 10) inferred 3 alive clusters instead of 2.

Finally, the last row in Figs. 9 and 10 displays the inferred number of factors for scenarios 1 and 2, respectively. In every single case, the estimate arising from fabMix is at least as close as the estimate arising from pgmm to the corresponding true value. Note, however, that in example 1 of scenario 2 both methods detect a smaller number of factors (2 instead of 3 factors). In all other cases, we observe that as the sample size increases both methods infer the “true” number of factors.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Papastamoulis, P. Clustering multivariate data using factor analytic Bayesian mixtures with an unknown number of components. Stat Comput 30, 485–506 (2020). https://doi.org/10.1007/s11222-019-09891-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-019-09891-z

Keywords

Navigation