Abstract
This work presents a mixture model allowing to cluster variables of different types. All variables being measured on the same n statistical units, we first represent every variable with a unit-norm operator in \({\mathbb {R}}^{n\times n}\) endowed with an appropriate inner product. We propose a von Mises–Fisher mixture model on the unit-sphere containing these operators. The parameters of the mixture model are estimated with an EM algorithm, combined with a K-means procedure to obtain a good starting point. The method is tested on simulated data and eventually applied to wine data.
Similar content being viewed by others
References
Banerjee A, Dhillon I, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises–Fisher distributions. J Mach Learn Res 6:1345–1382
Bry X, Cucala L (2018) Classifying variable-structures: a general framework. arXiv:1804.08901
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14:315–332
Chavent M, Kuentz V, Liquet B, Saracco J (2012) ClustOfVar: an R package for the clustering of variables. J Stat Softw 50:1–16
Chavent M, Kuentz V, Saracco J (2010) A partitioning method for the clustering of categorical variables. In: Proceedings of the 11th IFCS biennial conference and 33rd annual conference of the Gesellschaft für Klassifikation
Escoufier Y (1970) Échantillonnage dans une population de variables aléatoires réelles. Publications de l’Institut de Statistique de l’Université de Paris 19:1–47
Gomes A (1993) Reconnaissance de mélanges de lois de Bingham: application à la classification de variables. PhD Thesis, Université Montpellier 2
Gomes P (1987) Distribution de Bingham sur la n-sphere: une nouvelle approche de l’analyse factorielle. PhD Thesis, Université Montpellier 2
Grün B, Leisch F (2008) FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters. J Stat Softw 28:1–35
Hornik K, Feinerer I, Kober M, Buchta C (2012) Spherical k-means clustering. J Stat Softw 50:1–22
Hornik K, Grün B (2014) movMF: an R package for fitting mixtures of von Mises–Fisher distributions. J Stat Softw 58:1–31
Hornik K, Grün B (2014) On maximum likelihood estimation of the concentration parameter of von Mises–Fisher distributions. Comput Stat 29:945–957
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, Hoboken
Kiers H (1991) Simple structure in component analysis techniques for mixtures of qualitative and quantitative variables. Psychometrika 56:197–212
Mardia K, Jupp P (2000) Directional statistics, second edn. Wiley, Hoboken
McLachlan G, Peel D (2000) Finite mixture models. Wiley, Hoboken
Mood A, Graybill F, Boes D (2001) Introduction to the theory of statistics. Tata McGraw-Hill, New Delhi
Qannari EM, Vigneau E, Courcoux Ph (1998) Une nouvelle distance entre variables. Application en classification. Revue de Stat Appliquée 46:21–32
Robert P, Escoufier Y (1976) A unifying tool for linear multivariate statistical methods: the RV-coefficient. Appl Stat 25:257–265
Saracco J, Chavent M, Kuentz V (2010) Clustering of categorical variables around latent variables. Cahiers du GREThA UMR CNRS 5113, février 2010, Université Bordeaux 4
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Soffritti G (1999) Hierarchical clustering of variables: a comparison among strategies of analysis. Commun Stat Simul Comput 28:977–999
Tschuprow AA (1939) Principles of the mathematical theory of correlation. W. Hodge & Co
Vigneau E, Qannari EM (2003) Clustering of variables around latent components. Commun Stat Simul Comput 32:1131–1150
Vigneau E, Qannari EM, Sahmer K, Ladiray D (2006) Classification de variables autour de composantes latentes. Revue de Stat Appliquée 54:27–45
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: proofs
Appendix: proofs
a) \(\Phi ^2(X,Y)=\textit{tr}(\Pi _X \Pi _Y)\).
Proof
Let X and Y be two categorical variables with q and r levels respectively, and let \(\mathbf{X }\) and \(\mathbf{Y }\) denote their respective matrices of q (resp. r) uncentred indicator variables, and X and Y their respective matrices of \(q-1\) (resp. \(r-1\)) centred indicator variables. We have:
And so:
Besides, it can easily be shown that:
where
We have:
From (5), (6) and (7), we get:
b) \(\forall x: \text {arg }\underset{y\in S}{\text {min }}\Vert x-y\Vert ^2=\frac{x}{\Vert x\Vert }\), where S is the unit sphere. \(\square \)
Proof
Let \(\forall y,y^0=\frac{y}{\Vert y\Vert }\). Then
So:
Taking for instance \({\hat{y}}=x \) gives the result.
c) Rank-r average of normed projectors.
Let \(\Pi _U\) be the projector on a space spanned by H W-orthonormal vectors \(u_1 , \ldots , u_r \in {\mathbb {R}}^n \) . Let U denote the matrix \([u_1 , \ldots , u_r] \) . The rank-r average of a set of p normed projectors \({\tilde{O}}_j\) is defined as the normed projector \( \tilde{{\bar{O}}}^r = \frac{\Pi _U}{\sqrt{r}}\) which verifies :
Now, \(\forall j, {\tilde{O}}_j = X_j M_j X_j'W \) , where:
-
\(X_j\) is the variable associated with the projector and coded as mentioned in Sect. 2.1, and
-
\(M_j = \frac{(X_j'WX_j)^{-1}}{\sqrt{\dim (X_j)}}\)
We denote \( X = [X_1 , \ldots , X_p]\) and \(M = diag(M_j ; j=1,\ldots ,p) \) (block-diagonal matrix with blocks \(M_j\)).
Since \( \forall j, \llbracket \tilde{O}_U - {\tilde{O}}_j \rrbracket ^2 = 2(1 - [\tilde{O}_U | {\tilde{O}}_j ] ) \), we have:
Vectors \(u_1 , \ldots , u_r \) being W-orthonormal, this maximization program is exactly that of the (dual) PCA of array X with metric M and weights W. The solution vectors \(u_1 , \ldots , u_r\) are hence the r first PCs of (X, M, W), and \(\tilde{{\bar{O}}}^H\) is the normed projector on the space they span. \(\square \)
Rights and permissions
About this article
Cite this article
Bry, X., Cucala, L. A von Mises–Fisher mixture model for clustering numerical and categorical variables. Adv Data Anal Classif 16, 429–455 (2022). https://doi.org/10.1007/s11634-021-00449-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-021-00449-4