1 Introduction

Finite mixture models have been linked with clustering since the idea of defining a cluster in terms of a component in finite mixture model was put forth more than 60 years ago (see McNicholas 2016a, Section 2.1). Nowadays, mixture model–based clustering is a popular approach to clustering. A random vector X arises from a finite mixture model if, for all xX, its density can be written

$$f(\mathbf{x}\mid\boldsymbol{\vartheta})= \sum\limits_{g = 1}^{G} \pi_{g} f_{g}(\mathbf{x}\mid\boldsymbol{\theta}_{g}),$$

where πg > 0 such that \({\sum }_{g = 1}^{G} \pi _{g} = 1\) are the mixing proportions, fg(xθg) is the g th component density, and 𝜗 = (π, θ1, … , θG) denotes the vector of parameters with π = (π1, … , πG). The component densities f1(xθ1), … , fG(xθG) are typically taken to be of the same type, most commonly multivariate Gaussian. In fact, until a few years after the turn of the century, almost all work on clustering and classification using mixture models had been based on Gaussian mixture models (e.g., Banfield and Raftery 1993; Celeux and Govaert 1995; Celeux and Govaert 1995; Ghahramani and Hinton 1997; Tipping and Bishop 1999; McLachlan and Peel 2000; Fraley and Raftery2002).

Early work on non-Gaussian mixtures was on mixtures of multivariate t-distributions (e.g., Peel and McLachlan 2000). A little beyond the turn of the century, work on t-mixtures burgeoned into a substantial subfield of mixture model-based classification (e.g., McLachlan et al. 2007; Andrews and McNicholas2011a, b, 2012; Baek and McLachlan 2011; Steane et al. 2012; Lin et al. 2014; Pesevski et al. 2018). Around the same time, work on mixtures of skewed distributions took off, including work on skew-normal mixtures (e.g., Lin 2009), skew-t mixtures (e.g., Lin 2010; Vrbik and McNicholas 2012, 2014; Lee and McLachlan 2013a, b; Murray et al. 2014), Laplace mixtures (e.g., Franczak et al.2014), variance-gamma mixtures (McNicholas et al. 2017), generalized hyperbolic mixtures (Browne and McNicholas 2015), and other non-elliptically contoured distributions (e.g., Karlis and Santourian 2009; Murray et al. 2017; Tang et al. 2018). A thorough review of work on model-based clustering is given by McNicholas (2016b).

More recently, mixtures of multiple scaled distributions have been considered (Section 2). In the present manuscript, a multiple scaled generalized hyperbolic distribution is introduced (Section 3). Because the (multivariate) generalized hyperbolic distribution is not a special case of the multiple scaled generalized hyperbolic distribution, a coalesced generalized hyperbolic distribution is also developed (Section 4). The issue of cluster convexity, which has essentially been ignored in work to date on multiple scaled distributions is discussed and a special case of the multiple scaled generalized hyperbolic distribution is developed to guarantee that the components, i.e., the clusters, are convex (Section 5).

2 Multiple Scaled Distributions

The distribution of a p-dimensional random variable X is said to be a normal variance-mean mixture if its density can be written in the form

$$ f(\mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Sigma},\boldsymbol{\alpha},\boldsymbol{\theta})= {\int}_{0}^{\infty}{\phi_{p}\left( \mathbf{x}\mid\boldsymbol{\mu}+w\boldsymbol{\alpha},w\boldsymbol{\Sigma}\right)h\left( w\mid\boldsymbol{\theta}\right)dw}, $$
(1)

where ϕp (xμ + wα, wΣ) is the density of a p-dimensional Gaussian distribution with mean μ + wα and covariance matrix wΣ, and h (wθ) is the density of a univariate random variable W > 0 that has the role of a weight function (see Barndorff-Nielsen et al. 1982; Gneiting1997). This weight function can take on many forms, some of which lead to density representations for well-known non-Gaussian distributions, e.g., if h (wθ) is the density of an inverse-gamma random variable with parameters (ν/2, ν/2), then (1) is a representation of the skew-t distribution with ν degrees of freedom (see Demarta and McNeil 2005; Murray et al. 2014). Further details on, and examples of, normal variance-mean mixtures are given by Barndorff-Nielsen (1978), Kotz et al. (2001), and Kotz and Nadarajah (2004), among others.

Now, the density of W > 0 from an inverse-gamma distribution with parameters (α, β) is given by

$$ h\left( w\mid\alpha,\beta\right)=w^{-\alpha-1}\frac{\beta^{\alpha}}{{\Gamma}\left( \alpha\right)}\exp\left\{-\frac{\beta}{ w}\right\}, $$
(2)

where Γ(⋅) is the Gamma function. Setting α = β = ν/2 in (2) gives

$$ h\left( w\mid\nu/2,\nu/2\right)=w^{-\nu/2-1}\frac{(\nu/2)^{\nu/2}}{{\Gamma}\left( \nu/2\right)}\exp\left\{-\frac{\nu}{2w}\right\}, $$
(3)

Setting α = 0 in (1), and using (3) for h (wθ), it follows that the density of the multivariate t-distribution with ν degrees of freedom can be written

$$ \begin{array}{@{}rcl@{}} f_{t}(\mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Sigma},\nu) &=&{\int}_{0}^{\infty}{\phi_{p}\left( \mathbf{x}\mid\boldsymbol{\mu},w \boldsymbol{\Sigma}\right)h\left( w\mid\nu/2,\nu/2\right)dw}\\ &=&\frac{{\Gamma} \left( [{\nu+p}]/{2} \right) |\boldsymbol{\Sigma}|^{-{1}/{2}}}{(\pi \nu)^{{p}/{2}} {\Gamma} \left( {\nu}/{2} \right) \left[1 + {\delta(\mathbf{x}, \boldsymbol{\mu} | \boldsymbol{\Sigma})}/{\nu} \right]^{{(\nu + p)}/{2}}} \end{array} $$
(4)

where δ (x, μΣ) is the squared Mahalanobis distance between x and μ. Forbes and Wraith (2014) show that a multi-dimensional weight variable

$$\boldsymbol{\Delta}_{\mathbf{W}} = \text{diag}\left( w_{1}^{-1},\dots,w_{p}^{-1}\right)$$

can be incorporated into Eq. 1 via an eigen-decomposition of the symmetric positive-definite matrix Σ. Specifically, they set Σ = ΓΦΓ′, where Γ is a p × p matrix of eigenvectors and Φ is a p × p diagonal matrix containing the eigenvalues of Σ. It follows that the density of X becomes

$$ \begin{array}{@{}rcl@{}} &&\!\!\!\!\!f\left( \mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Gamma}, \boldsymbol{\Phi},\boldsymbol{\alpha},\boldsymbol{\theta}\right)=\\ &&{\int}_{0}^{\infty}\cdots{\int}_{0}^{\infty}\phi_{p}\left( \mathbf{x}\mid\boldsymbol{\mu}+\boldsymbol{\Delta}_{\mathbf{W}}\boldsymbol{\alpha}, \boldsymbol{\Gamma}\boldsymbol{\Delta}_{\mathbf{W}}\boldsymbol{\Phi}\boldsymbol{\Gamma}^{\prime} \right) h_{\mathbf{W}}\left( w_{1},\dots,w_{p}\mid\boldsymbol{\theta}\right)dw_{1}{\dots} dw_{p}, \end{array} $$
(5)

where

$$h_{\mathbf{W}}\left( w_{1},\dots,w_{p}\mid\boldsymbol{\theta}\right) = h\left( w_{1}\mid\boldsymbol{\theta}_{1}\right)\times\dots\times h\left( w_{p}\mid\boldsymbol{\theta}_{p}\right)$$

is a p-dimensional density such that the random variables W1, … , Wp are independent, i.e., the weights are independent. The density given in Eq. 5 adds flexibility to normal variance-mean mixtures because the parameters θ1, … , θp are free to vary in each dimension. Using the density in Eq. 5, Forbes and Wraith (2014) derive the density of a multiple scaled multivariate-t distribution, Wraith and Forbes (2015) derive a multiple scaled normal-inverse Gaussian distribution, and Franczak et al. (2015) develop a multiple scaled shifted asymmetric Laplace distribution.

Setting Σ = ΓΦΓ′, it follows from Eq. 5 that the density of a multiple scaled analogue of Eq. 4 can be written

$$ f_{t\text{\tiny{MS}}}(\mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Gamma},\boldsymbol{\Phi},\boldsymbol{\nu}) = {\int}_{0}^{\infty}\cdots{\int}_{0}^{\infty}\phi_{p}\left( \mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Gamma}\boldsymbol{\Phi}\boldsymbol{\Delta}_{\mathbf{W}}\boldsymbol{\Gamma}^{\prime}\right)h_{\mathbf{W}}\left( w_{1},\dots,w_{p}\mid\boldsymbol{\nu}\right)dw_{1}{\dots} dw_{p}, $$
(6)

where \(\boldsymbol {\Delta }_{\mathbf {W}} = \text {diag}\left (w_{1}^{-1},\dots ,w_{p}^{-1}\right )\) and the weight function

$$h_{\mathbf{W}}\left( w_{1},\dots,w_{p}\mid\boldsymbol{\nu}\right) = h\left( w_{1}\mid\nu_{1}/2,\nu_{1}/2\right)\times\dots\times h\left( w_{p}\mid\nu_{p}/2,\nu_{p}/2\right)$$

is a p-dimensional Gamma density, where h (wjνj/2, νj/2) is given by Eq. 3. Note that the scaled Gaussian density in Eq. 6 can be written

$$ \begin{array}{@{}rcl@{}} \phi_{p}\left( \mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Gamma}\boldsymbol{\Phi}\boldsymbol{\Delta}_{\mathbf{W}}\boldsymbol{\Gamma}^{\prime}\right) &=& \prod\limits_{j = 1}^{p}{\phi_{1}\left( [\boldsymbol{\Gamma}^{\prime}\mathbf{x}]_{j} \mid [\boldsymbol{\Gamma}^{\prime}\boldsymbol{\mu}]_{j}, {\Phi}_{j}w_{j}^{-1}\right)} \\&=& \prod\limits_{j = 1}^{p}{\phi_{1}\left( [\boldsymbol{\Gamma}^{\prime}(\mathbf{x}-\boldsymbol{\mu})]_{j}\mid0,{\Phi}_{j}w_{j}^{-1} \right)} , \end{array} $$
(7)

where \(\phi _{1}\left ([\boldsymbol {\Gamma }^{\prime }(\mathbf {x}-\boldsymbol {\mu })]_{j}\mid 0,{\Phi }_{j}w_{j}^{-1} \right ) \) is the density of a univariate Gaussian distribution with mean 0 and variance \({\Phi }_{j}w_{j}^{-1}\), [Γ′ (xμ)]j is the j th element of Γ′(xμ), and Φj is the j th eigenvalue of Φ, i.e., the j th diagonal element of the matrix Φ. It follows that Eq. 6 can be written

$$ f_{t\text{\tiny{MS}}}(\mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Gamma},\boldsymbol{\Phi},\boldsymbol{\nu}) = \prod\limits_{j = 1}^{p}{\int}_{0}^{\infty}{\phi_{1}\left( [\boldsymbol{\Gamma}^{\prime}(\mathbf{x}-\boldsymbol{\mu})]_{j}\mid 0, {\Phi}_{j}w_{j}^{-1} \right)h\left( w_{j}\mid\nu_{j}/2,\nu_{j}/2\right)dw_{j}}. $$
(8)

Solving the integral in Eq. 8 gives the density of a multiple scaled multivariate-t distribution,

$$ f_{t\text{\tiny{MS}}}(\mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Gamma},\boldsymbol{\Phi},\boldsymbol{\nu})=\prod\limits_{j = 1}^{p}{\frac{{\Gamma}([\nu_{j}+ 1]/2)}{{\Gamma}(\nu_{j}/2)({\Phi}_{j}\nu_{j}\pi)^{1/2}} \left[1+\frac{[\boldsymbol{\Gamma}^{\prime}(\mathbf{x}-\boldsymbol{\mu})]_{j}^{2}}{{\Phi}_{j}\nu_{j}}\right]^{-(\nu_{j}+ 1)/2}}, $$
(9)

where Φj is the j th eigenvalue of Φ, Γ is a matrix of eigenvectors, μ is a location parameter, and \([\boldsymbol {\Gamma }^{\prime }(\mathbf {x}-\boldsymbol {\mu })]_{j}^{2}/{\Phi }_{j}\) can be regarded as the squared Mahalanobis distance between x and μ.

The main difference between the traditional multivariate-t density given in Eq. 4 and the multiple scaled multivariate-t density given in Eq. 9 is that the degrees of freedom can now be parameterized separately in each dimension j. Therefore, unlike the standard multivariate-t distribution, the multiple scaled density in Eq. 9 can account for different tail weight in each dimension (Forbes and Wraith 2014).

3 Mixture of Multiple Scaled Generalized Hyperbolic Distributions

There are different ways to formulate the density of a generalized hyperbolic distribution (GHD; see McNeil et al. 2005, for example). Browne and McNicholas (2015) use a mixture of GHDs (MGHDs) for clustering and they use the following formulation for the density of a p-dimensional random vector X from a GHD:

$$ f_{\text{GH}}(\mathbf{x}\mid\boldsymbol{\theta}) =\left[ \frac{ \omega + \delta\left( \mathbf{x}, \boldsymbol{\mu}| \mathbf{\Sigma}\right)} { \omega+ \boldsymbol{\alpha}^{\prime}\boldsymbol{\Sigma}^{-1}\boldsymbol{\alpha}} \right]^{(\lambda-{p}/{2})/2}\frac{ K_{\lambda - {p}/{2}}\left( \sqrt{\left[ \omega+ \boldsymbol{\alpha}^{\prime}\boldsymbol{\Sigma}^{-1}\boldsymbol{\alpha}\right]\left[\omega +\delta\left( \mathbf{x}, \boldsymbol{\mu}| \boldsymbol{\Sigma}\right)\right]}\right)}{ \left( 2\pi\right)^{{p}/{2}} \left| \boldsymbol{\Sigma} \right|^{{1}/{2}} K_{\lambda}\left( \omega\right)\exp\left\{-\left( \mathbf{x}-\boldsymbol{\mu}\right)'\boldsymbol{\Sigma}^{-1}\boldsymbol{\alpha}\right\}}, $$
(10)

where \(\boldsymbol {\mu }\in \mathbb {R}^{p}\) is the location parameter, \(\boldsymbol {\alpha }\in \mathbb {R}^{p}\) is the skewness parameter, \(\boldsymbol {\Sigma }\in \mathbb {R}^{p\times p}\) is the scale matrix, \(\lambda \in \mathbb {R}\) is the index parameter, \(\omega \in \mathbb {R}^{+}\) is the concentration parameter, and Kλ is the modified Bessel function of the third kind with index λ. Now, Xw ∽N(μ + wα, wΣ) and Wx ∼GIG(ω + αΣ− 1α, ω + δ (x, μ|Σ), λp/2),where W ∼GIG(a, b, λ) denotes that W follows a generalized inverse Gaussian (GIG) distribution with density formulated as

$$ q(w | a,b,\lambda) = \frac{(a/b)^{\lambda/2}w^{\lambda-1}}{2K_{\lambda}(\sqrt{ab})} \exp\left\{-\frac{aw+b/w}{2}\right\}, $$
(11)

for w > 0, where \(a,b\in \mathbb {R}^{+}\), and \(\lambda \in \mathbb {R}\). The GIG distribution has some attractive properties including the tractability of the following expected values:

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}\left[ W \right] = \sqrt{\frac{b}{a}}\frac{K_{\lambda+ 1}\left( \sqrt{ab}\right)}{K_{\lambda}(\sqrt{ab})}, \qquad\qquad\mathbb{E}\left[{1}/{W}\right] = \sqrt{\frac{a}{b}}\frac{K_{\lambda+ 1}\left( \sqrt{ab}\right)}{K_{\lambda}\left( \sqrt{ab}\right)} -\frac{2\lambda}{b}, \end{array} $$
(12)
$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}[\log W] =\log\left( \sqrt{\frac{b}{a}}\right)+\frac{1}{K_{\lambda}\left( \sqrt{ab}\right)}\frac{\partial}{\partial\lambda}K_{\lambda}\left( \sqrt{ab}\right). \end{array} $$
(13)

Write X ∼GHD(μ, Σ, α, ω, λ) to denote that the p-dimensional random variable X has the density in Eq. 10.

Now, note that the formulation of the GHD in Eq. 10 can be written as a normal variance-mean mixture where the univariate density is GIG, i.e.,

$$ \mathbf{X} = \boldsymbol{\mu} + W\boldsymbol{\alpha}+ \sqrt{W} \mathbf{V}, $$
(14)

where V ∼N(0, Σ) and W has density

$$ h(w\mid\omega,1,\lambda)=\frac{w^{\lambda-1}}{2 K_{\lambda}(\omega)}\exp{\left\{-\frac{\omega}{2}\left( w + \frac{1}{w}\right)\right\}}, $$
(15)

for w > 0, where ω and λ are as previously defined. Note that Eq. 15 is just an alternative parameterization of the GIG distribution. From Eqs. 14 and 15, it follows that the generalized hyperbolic density can be written

$$ f(\mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Sigma},\boldsymbol{\alpha},\omega,\lambda)= {\int}_{0}^{\infty}{\phi_{p}\left( \mathbf{x}\mid\boldsymbol{\mu}+w\boldsymbol{\alpha},{w}\boldsymbol{\Sigma}\right)h(w\mid\omega,1,\lambda)dw}. $$
(16)

We can use Eqs. 5 and 16 and an alternative parameterization to write the density of a multiple scaled generalized hyperbolic distribution (MSGHD) as

$$ \begin{array}{@{}rcl@{}} &&{}f_{\text{\tiny{MSGHD}}}(\mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Gamma},\boldsymbol{\Phi},\boldsymbol{\alpha},\boldsymbol{\omega},\boldsymbol{\lambda})=\\ &&{\int}_{0}^{\infty}\dots{\int}_{0}^{\infty}\phi_{p}\left( \boldsymbol{\Gamma}^{\prime} \mathbf{x} - \boldsymbol{\mu} - \boldsymbol{\Delta}_{\mathbf{w}} \boldsymbol{\alpha} \mid \mathbf{0}, \boldsymbol{\Delta}_{\mathbf{w}} \boldsymbol{\Phi} \right) h_{\mathbf{w}}(w_{1},\dots,w_{p}\mid\boldsymbol{\omega},\mathbf{1},\boldsymbol{\lambda})dw_{1}{\dots} dw_{p}, \end{array} $$
(17)

where ω = (ω1, … , ωp)′, λ = (λ1, … , λp)′, 1 is a p-vector of 1s, and

$$h_{\mathbf{W}}(w_{1},\dots,w_{p}\mid\boldsymbol{\omega},\mathbf{1},\boldsymbol{\lambda}) = h(w_{1}\mid\omega_{1},1,\lambda_{1})\times\dots\times h(w_{p}\mid\omega_{p},1,\lambda_{p}). $$

From Eqs. 6 and 7, it follows that Eq. 17 can be written

$$ \begin{array}{@{}rcl@{}} &&f_{\text{\tiny{MSGHD}}}\left( \mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Gamma},\boldsymbol{\Phi},\boldsymbol{\alpha},\boldsymbol{\omega},\boldsymbol{\lambda}\right) =\prod\limits_{j = 1}^{p}{\int}_{0}^{\infty}\phi_{1}\left( \left[\boldsymbol{\Gamma}^{\prime} \mathbf{x}-\boldsymbol{\mu}-\boldsymbol{\Delta}_{\mathbf{w}} \boldsymbol{\alpha} \right]_{j}\mid 0,{\Phi}_{j}w_{j} \right)\\&&\quad\left.\times ~h_{W}(w_{j}\mid\omega_{j},1,\lambda_{j}\right)dw_{j}\\ &&=\prod\limits_{j = 1}^{p}\left\{\left[\frac{\omega_{j}+ {\Phi}_{j}^{-1}\left( \left[\boldsymbol{\Gamma}^{\prime}\mathbf{x}\right]_{j}-\mu_{j}\right)^{2}}{\omega_{j}+ {\alpha_{j}^{2}} {{\Phi}_{j}}^{-1}} \right]^{\frac{\lambda_{j}-{1}/{2}}{2}}\right.\\&&\quad\times\left.\frac{K_{\lambda_{j}-{1}/{2}}\left( \sqrt {\left[\omega_{j}+{\alpha_{j}^{2}} {{\Phi}_{j}}^{-1}\right]\left[\omega_{j}+ {\Phi}_{j}^{-1}\left( \left[\boldsymbol{\Gamma}^{\prime}\mathbf{x}\right]_{j}-\mu_{j}\right)^{2}\right]}\right)}{(2\pi)^{{1}/{2}}{{\Phi}_{j}}^{{1}/{2}}K_{\lambda_{j}}(\omega_{j})\exp{\left\{-\left( \left[\boldsymbol{\Gamma}^{\prime}\mathbf{x}\right]_{j}-\mu_{j}\right){{\Phi}_{j}^{-1}\alpha_{j}}\right\}}}\vphantom{}\right\}, \end{array} $$

where [Γx]j is the j th element of the vector Γx, μj is the j th element of the location parameter μ, αj is the j th element of the skewness parameter α, Γ is a p × p matrix of eigenvectors, Φj is the j th eigenvalue of the diagonal matrix Φ, ω = (ω1, … , ωp)′ controls the concentration in each dimension p, and λ = (λ1, … , λp)′ is a p-dimensional index parameter. Write X ∽MSGHD(μ, Γ, Φ, α, ω, λ) to indicate that the random vector X follows an MSGHD with density fMSGHD(xμ, Γ, Φ, α, ω, λ). Then, a mixture of MSGHDs (MMSGHDs) has density

$$ f(\mathbf{x}\mid\boldsymbol{\boldsymbol{\vartheta}})= \sum\limits_{g = 1}^{G} \pi_{g} f_{\text{\tiny{MSGHD}}}\left( \mathbf{x}\mid\boldsymbol{\mu}_{g},\boldsymbol{\Gamma}_{g}, \boldsymbol{\Phi}_{g}, \boldsymbol{\alpha}_{g},\boldsymbol{\omega}_{g},\boldsymbol{\lambda}_{g}\right). $$
(18)

The identifiability of the MMSGHD is discussed in Appendix 3.

4 Mixture of Coalesced Generalized Hyperbolic Distributions

Note that the generalized hyperbolic distribution is not a special or limiting case of the MSGHD under any parameterization with p > 1. Motivated by this, consider a coalesced generalized hyperbolic distribution (CGHD) that contains both the generalized hyperbolic distribution and MSGHD as limiting cases. The CGHD arises through the introduction of a random vector

$$ \mathbf{R}=U\mathbf{X}+(1-U)\mathbf{S}, $$
(19)

where X = ΓY, Y ∽GHD (μ, Σ, α, ω0, λ0), Σ = ΓΦΓ′, S ∽MSGHD(μ, Γ, Φ, α, ω, λ), and U is an indicator variable such that

$$U=\left\{\begin{array}{ll} 1 & \text{ if\ } \mathbf{R} \text{ follows a generalized hyperbolic distribution, and}\\ 0 & \text{ if\ } \mathbf{R} \text{ follows a MSGHD.} \end{array}\right.$$

It follows that \(\mathbf {X} = \boldsymbol {\Gamma }\boldsymbol {\mu } + W\boldsymbol {\Gamma }\boldsymbol {\alpha } + \sqrt {W}\boldsymbol {\Gamma }\mathbf {V}, \)where ΓV ∽Np (0, ΓΦΓ′), S = Γμ + ΓαΔw + ΓA,where ΓA ∽Np (0, ΓΔwΦΓ′), and the density of R can be written

$$ \begin{array}{ll} f_{\text{\tiny{CGHD}}}(\mathbf{r}\mid\boldsymbol{\mu},&\boldsymbol{\Gamma}, \boldsymbol{\Phi},\boldsymbol{\alpha},\boldsymbol{\omega},\boldsymbol{\lambda},\omega_{0},\lambda_{0},\varpi)\\ &= \varpi f_{\text{\tiny{GHD}}}\left( \mathbf{r}\mid\boldsymbol{\mu},\boldsymbol{\Gamma} \boldsymbol{\Phi}\boldsymbol{\Gamma}^{\prime} ,\boldsymbol{\alpha},\omega_{0},\lambda_{0}\right)+ (1-\varpi)f_{\text{\tiny{MSGHD}}}\left( \mathbf{r}\mid\boldsymbol{\mu},\boldsymbol{\Gamma},\boldsymbol{\Phi},\boldsymbol{\alpha},\boldsymbol{\omega},\boldsymbol{\lambda}\right), \end{array} $$
(20)

where fGHD(⋅) is the density of a generalized hyperbolic random variable, fMSGHD(⋅) is the density of a MSGHD random variable, and ϖ ∈ (0, 1) is a mixing proportion. Note that the random vector R would be distributed generalized hyperbolic if ϖ = 1 and would be distributed MSGHD if ϖ = 0. However, we must restrict ϖ ∈ (0, 1) for identifiability reasons. Specifically, if ϖ = 0 then the value of fCGHD(rμ, Γ, Φ, α, ω, λ, ω0, λ0, ϖ) will be the same for any ω0 and λ0 whereas, if ϖ = 1, then the value of fCGHD(rμ, Γ, Φ, α, ω, λ, ω0, λ0, ϖ) will be the same for any ω and λ. The parameters μ, α, Γ, and Φ are the same for both densities, the parameters ω0 and λ0 are univariate values peculiar to the generalized hyperbolic distribution, and the p-dimensional parameters ω and λ are peculiar to the MSGHD. Write R ∽CGHD(μ, Γ, Φ, α, ω, λ, ω0, λ0, ϖ) to indicate that the random vector R follows a CGHD with density in Eq. 20.

A mixture of CGHDs (MCGHDs) has density

$$ f(\mathbf{x}\mid\boldsymbol{\vartheta})= \sum\limits_{g = 1}^{G} \pi_{g}f_{\text{\tiny{CGHD}}}\left( \mathbf{x}\mid\boldsymbol{\mu}_{g},\boldsymbol{\Gamma}_{g},\boldsymbol{\Phi}_{g},\boldsymbol{\alpha}_{g},\boldsymbol{\omega}_{g},\boldsymbol{\lambda}_{g},\omega_{0g},\lambda_{0g},\varpi_{g}\right). $$

Parameter estimation can be carried out via a generalized expectation-maximization (GEM) algorithm (Dempster et al. 1977). There are four sources of missing data: the latent w0ig, the multi-dimensional weights Δwig = diag(w1ig, … , wpig), the component membership labels zig, and the inner component labels uig, for i = 1, … , n and g = 1, … , G. As usual, zig = 1 if observation i belongs to component g and zig = 0 otherwise. Similarly, uig = 1 if observation i, in component g, is distributed generalized hyperbolic and uig = 0 if observation i, in component g, is distributed MSGHD. It follows that the complete-data log-likelihood for the MCGHD is

$$ \begin{array}{@{}rcl@{}} l_{\text{c}}& =& \sum\limits_{i = 1}^{n} \sum\limits_{g = 1}^{G} \left\{\vphantom{\sum\limits_{j = 1}^{p}} z_{ig} \log \pi_{g} + z_{ig}{u_{ig}} \log \varpi_{g} + z_{ig}(1 - u_{ig}) \log (1-\varpi_{g}) \right.\\ & &+ z_{ig}{u_{ig}} \log h \left( w_{0ig} | \omega_{0g}, 1, \lambda_{0g} \right)+ z_{ig}(1 - u_{ig}) \sum\limits_{j = 1}^{p}\log h \left( w_{jig} | \omega_{jg}, 1, \lambda_{jg} \right) \\ &&+ z_{ig}{u_{ig}} \log \phi_{p} \left( \boldsymbol{\Gamma}_{g}^{\prime}\mathbf{x}_{i} | \boldsymbol{\mu}_{g} + w_{0ig} \boldsymbol{\alpha}_{g} , w_{0ig} \boldsymbol{\Phi} \right)\\ &&\left. + z_{ig}(1 - u_{ig}) \sum\limits_{j = 1}^{p} \log \phi_{1}\left( [\boldsymbol{\Gamma}_{g}^{\prime}\mathbf{x}_{i} ]_{j} | \mu_{jg} + w_{jig} \alpha_{jg} , \omega_{jg} \phi_{jg} \right)\right\}. \end{array} $$

Further details on parameter estimation (Appendix 1) and the identifiability of the MCGHD (Appendix 3) are discussed in appendices.

5 Cluster Convexity

The definition of clusters has been discussed quite extensively in the literature. Recently, Hennig (2015) provides a very interesting discussion of some potential characteristics that clusters may have. Even though they cannot always be observed—and, in some situations, desired characteristics may even conflict—such characteristics provide important links and contrasts between different definitions that are used for a cluster. In fact, the idea of desirable characteristics is not a new one, e.g., Cormack (1971) gives internal cohesion and external isolation as two “basic ideas” in this direction. In this paper, we follow the definition given by McNicholas (2016a): “a cluster is a unimodal component within an appropriate finite mixture model.” The term appropriate means that the model has the flexibility to fit the data and, in many cases, this also means that each cluster is convex (see McNicholas 2016a, Section 9.1). The development of flexible models outlined herein may make it a little easier to find an “appropriate” component density.

The MSGHD is more flexible than the GHD; however, similar to the multiple scaled multivariate t-distribution of Forbes and Wraith (2014), the MSGHD can have contours that are not convex. Accordingly, the MMSGHD can have components that are non-convex, leading to non-convex clusters. Consider the data in Fig. 1. How many clusters are there? The most plausible answer to this question is two overlapping clusters: one with positive correlation between the variables and another with negative correlation between the variables. There may also be an argument for four or five. Note that the data are generated from a G = 2 component mixture of multivariate t-distributions.

Fig. 1
figure 1

Scatter plot of data generated from a two-component t-mixture (top-left) along with contours from fitted G = 1 component MMSGHD (top-right), G = 2 component MMSGHD (bottom-left), and G = 2 component McMSGHD (bottom-right) models, where plotting symbol and color represent predicted classifications

The MMSGHD is fitted to these data for G = 1, … , 5 and, as is common in model-based clustering applications, the Bayesian information criterion (BIC; Schwarz 1978) is used to select G. Note that the BIC is given by

$$\text{BIC}= 2l(\hat{\boldsymbol{\vartheta}})-\rho\log n,$$

where \(l(\hat {\boldsymbol {\vartheta }})\) is the maximized log-likelihood, ρ is the number of free parameters, and n is the number of observations. For the MMSGHD, the BIC selects a G = 1 component model (Fig. 1). Furthermore, looking at the G = 2 component MSGHD solution (Fig. 1) confirms that the problem is not just one of model selection; the G = 2 component MSGHD solution selects one component that is roughly elliptical and another that is not convex. On the other hand, forcing the MSGHD to be convex—which can be done by imposing the constraint λj > 1, for j = 1, … , p—leads to what we call the convex MSGHD (cMSGHD). Fitting the corresponding mixture of cMSGHDs (McMSGHDs) ensures that, if each component is associated with a cluster, then convex clusters are guaranteed. Results in a G = 2 component model being selected (Fig. 1). Formally, this amounts to insuring that the MSGHD is quasi-convex (see Appendix 2). The general point here is that if convexity is not enforced, then the MSGHD can give components that contain multiple clusters. While it is easy to spot this in two dimensions, e.g., Fig. 1, this phenomenon may go unrecognized in higher dimensions, possibly resulting in greatly misleading results. Of course, the issue of non-convex clusters does not arise with most model-based approaches; however, when multiple scaled mixtures are considered, the issue can crop up. Another example in a similar vein is given in Fig. 2, where data are generated from a G = 3 component mixture of multivariate t-distributions. The selected MMSGHD has G = 2 components, including one clearly non-convex cluster, while the McMSGHD gives sensible clustering results.

Fig. 2
figure 2

Scatter plots for model-based clustering results on data simulated from a three-component t-mixture (top-left), with contours from the selected MMSGHD (top-right) and McMSGHD (bottom) models, respectively, where plotting symbol and color represent predicted classifications

The intention behind the introduction of the McMSGHD is not that it should supplant the MMSGHD, but rather that it provides a convenient check on the number of components. In a higher dimensional application, where visualization is difficult or impossible, situations where the selected MMSGHD has fewer components than the selected McMSGHD will deserve special attention. Of course, this is not to say that the selected McMSGHD will always have more components in situations where the MSGHD has too few, but rather that it will help to avoid the sort of situations depicted in Figs. 1 and 2. Note that parameter estimation for the McMSGHD is analogous to the algorithm for the MMSGHD algorithm but λkj (k = 0, 1, … , G, j = 1, … , p) is updated only if the value of its update exceeds 1.

6 Illustrations

6.1 Implementation and Evaluation

In the following illustrations, we fit the MGHDs, the MMSGHDs, the McMSGHDs, and the MCGHDs using the corresponding functions available in the MixGHD package (Tortora et al. 2017) for R (R Core Team 2017). We use k-means and k-medoids clustering to initialize the \(\hat {z}_{ig}\). The adjusted Rand index (ARI; Hubert and Arabie 1985) is used to compare predicted classifications with true classes. The ARI corrects the Rand index (Rand 1971) for chance, its expected value under random classification is 0, and it takes a value of 1 when there is perfect class agreement. Steinley (2004) gives guidelines for interpreting ARI values.

Several mixtures of skewed distributions have been proposed for model-based clustering and classification. Among them, three different formulations of the multivariate skew-t distribution have been used for clustering in the mixture setting. The formulation used by Murray et al. (2014) is a special case of the generalized hyperbolic distributions. Azzalini et al. (2016) refers to the other two formulations as the classical and SDB formulations (for the initials of the authors’ names), respectively. Lee and McLachlan (2013b, 2014) compare the mixture of classical skew-t distributions to the mixture of SDB skew-t distributions (MSDBST) and to skew-normal analogues of both formulations; their results lead one to believe that the MSDBST is generally preferable (see Azzalini et al. 2016, for an alternative viewpoint). The MSDBST approach is implemented using the EMMIXuskew package (Lee and McLachlan 2013a) for R and it is used for comparison herein.

6.2 Simulation Study

We use a simulation study to measure the performance of the proposed methods. The interest is to observe the ARI under different circumstances, the data are generated using two-component mixtures of Gaussian (Scenario 1), generalized hyperbolic (Scenario 2), and multiple scaled generalized hyperbolic distributions (Scenario 3), using the R function mvrnorm and the stochastic relationships given in Eq. 14 and Section 4. For each scenario, a three-factor full factorial design was used, where the factors are medium (M) or high (H) correlation, 25% or 50% overlapping, and same (S), n1 = n2 = 100, or different (D), n1 = 150 and n2 = 50, number of elements per component. In all the scenarios, μ1 = 1. Table 1 shows the average ARI, with standard deviation, obtained on 10 data sets generated from a Gaussian distribution with p = 10, G = 2, correlation equal to 0.33 for M and 0.66 for H, μ2 = 2 for 25% overlapping and μ2 = 3 for 50% overlapping. Table 2 shows the average ARI, with standard deviation, obtained on 10 data sets generated from a MGHD distribution with p = 10, G = 2, α = 1, λ = 0.5, ω = 1 and correlation equal to 0.25 for M and 0.5 for H, μ2 = 10 for 25% overlapping and μ2 = 15 for 50% overlapping. Table 3 shows the average ARI, with 1 standard deviation, obtained on 10 data sets generated from a MMSGHD distribution with p = 10, G = 2, α1 = 1, α2 = -1, λ = 0.5, ω = 1 and correlation equal to 0.25 for M and 0.35 for H, μ2 = 19 for 25% overlapping and μ2 = 23 for 50% overlapping. For the data sets generated using the MGHD and the MMSGHD, the values of the correlation had to be reduced in order to maintain 25% and 50% overlap, higher correlation lead to higher overlap. Figure 3 shows the scatterplots of the data simulated using the two-component MGDs, MGHDs, and MMSGHDs with high correlation, 50% overlapping and cluster of different size, where plotting symbol and color represent the true classifications.

Fig. 3
figure 3

Scatter plots for data simulated from a two-component MGDs (top-left), MGHDs (top-right) and MMSGHDs (bottom) models, respectively, with high correlation, 50% overlapping and clusters of different size, where plotting symbol and color represent the true classifications

Table 1 Average ARI values, with standard deviations, for each data set generated from MGDs, with different starting parameters
Table 2 Average ARI values, with standard deviations, for each data set generated from MGHDs, with different starting parameters and α = 1
Table 3 Average ARI, with standard deviations, for each data set generated from MMSGHDs, with different starting parameters and α = 1

The MCGHDs perform at least comparably to the best of the MMSGHDs and MGHDs. As expected, the MGHDs over performs the MMSGHDs when the data are generated using a MGHDs model and vice versa when the data are generated from a MMSGHDs. The two methods perform similarly on normally distributed clusters. The generated clusters are all convex and that explains the similarity of the results obtained using MMSGHDs and McMSGHDs. The level of overlapping has a notable impact on the ARI, as expected. The performance of all the methods is slightly better when the clusters are of the same size and when there is a lower correlation.

6.3 Real Data Analysis

To assess the classification performance of the MGHDs, the MMSGHDs, the McMSGHDs, the MCGHDs, and the MSDBST, we consider four real data sets that are commonly used within the model-based clustering literature (Table 4).

Table 4 Summary details for four real data sets commonly used within the model-based clustering literature

To compare the performance of the methods, we set the number of components, G, equal to the true number of classes. In general, the BIC is used to select the best starting criterion between k-means and k-medoids; however, because of the presence of outliers, only k-medoids starts are used for the bankruptcy data set. Table 5 displays the classification performance. Note that MSDBST is not used on the AIS data set because of the prohibitively high dimensionality (p = 11). The MCGHD generally performs comparably to the best of the other four approaches. Specifically, the MCGHD gives the best classification performance—either outright or jointly—for the four data sets. Interestingly, the MGHD gives very good classification performance on three of the four data sets; however, this must be taken in context with its very poor classification performance on the bankruptcy data set (ARI ≈ 0). It is also interesting to compare the classification performance of the McMSGHD to the MMSGHD as well as that of the MGHD to the MMSGHD. The McMSGHD and the MGHD approaches both outperform MMSGHD for one data set, and give a similar performance on two of the other three data sets. This highlights the fact that a mixture of multiple scaled distributions may well not outperform its single scaled analogue, and underlines the need for an approach with both the MSGHD and MGHD models as special cases. The results for the bankruptcy data illustrate that the MCGHD approach can give very good classification performance in situations where neither the MGHD nor the MMSGHD perform well. Finally, the MCGHD outperforms the MSDBST on two data sets and gives the same result on the third (recall that the MSDBST could not be fitted to the AIS data).

Table 5 ARI values for the MCGHD, MGHD, MMSGHD, McMSGHD, and MSDBST approaches on four real data sets

In this analysis, we took the number of components to be known. However, in a true clustering scenario, we would not have a priori information about the number of groups. Therefore, the analysis was repeated without this assumption and, for all but the bankruptcy data set, all approaches were run for G = 1, … , 5 components. Because of the small number of observations, the bankruptcy data were run for G = 1, 2, 3. Table 6 gives the number of components selected by the BIC.

Table 6 Selected number of components using the BIC for each real data set and corresponding ARI in brackets, where the correct number of components is highlighted in italics

For the bank note data, the BIC selects the correct number of components for every method; however, for the bankruptcy data, it picks the right number of components only for McMSGHDs. For the seed and AIS data sets, the BIC does not select the correct number of components for any approach. Recall that the MSDBST could not be fitted to the AIS data; also, it could not be fitted to the seed data for G = 4. These results illustrate that the BIC is not necessarily reliable for selecting the number of components in real data examples for any of the five approaches.

6.4 Computation Time

For the data sets listed in Table 4, we measured the elapsed user times, in seconds, to perform 100 (G)EM iterations. Note that all code was run in R version 3.0.2 on a 32-core Intel Xeon E5 server with 256GB RAM running 64-bit CentOS. Figure 4 displays the average elapsed time for two replications of each algorithm using a k-means and a k-medoids starting partition for G = 1, … , 5 components. The EM algorithm for the MSDBST is significantly slower than that of the hyperbolic-based approaches (Fig. 4). In fact, on the banknote data set, it takes the MSDBST more than 11 h to perform the required number of EM iterations when G = 2 and about 63 h when G = 5. For the seeds data set, it takes the MSDBST more than 5 h when G = 2 and more than 10 h when G = 5, whereas the G = 5 component MMSGHD, McMSGHD, and MCGHD approaches need less than 25 s. For the bankruptcy data set, the MSDBST requires approximately 80 s when G = 3, whereas the hyperbolic distributions need less than 20 s each.

Fig. 4
figure 4

The average elapsed time to perform 100 iterations of the (G)EM algorithm when varying the number of components for MCGHD, MGHD, MMSGHD, McMSGHD, and MSDBST models on the bankruptcy, banknote, seeds, and AIS data sets, respectively

7 Discussion

Novel MCGHDs, MMSGHDs, and McMSGHDs models have been introduced and applied for model-based clustering. The GHD is a flexible distribution, capable of handling skewness and heavy tails, and has many well known distributions as special or limiting cases. Furthermore, it is a normal variance-mean mixture, arising via a relationship between a multivariate Gaussian and an univariate GIG distribution. The MSGHD extends the GHD to include a multivariate GIG distribution, increasing the flexibility of the model. However, the GHD is not a special case of the MSGHD; hence, we created MCGHDs, which has both the GHD and MSGHD as special cases. The McMSGHD approach was introduced as a convex version of the MMSGHD, and this point deserves some further discussion. The extension of the multivariate-t distribution to multiple scale was carried out by Forbes and Wraith (2014); as discussed in the Appendix 2, the multiple scaled multivariate t-distribution cannot be quasi-concave, i.e., the clusters associated with a mixture of multiple scaled multivariate t-distributions cannot be convex. We have seen examples where the MMSGHD can put multiple clusters into one component, and the McMSGHD has an important role in helping to prevent this; if both approaches are fitted and lead to different numbers of components, then further attention is warranted.

Comparing the MGHD, MMSGHD, McMSGHD, and MCGHD approaches yielded some interesting results. Among them, we see that the MMSGHD does not necessarily outperform the MGHD; far from it, in fact, only the MGHD approach gave better clustering performance than the MMSGHD approach on one of the four real data sets we considered, as well as identical performance on the other two. This underlines the fact that a mixture of multiple scaled distributions may well not outperform its single scaled analogue, and highlights the benefit of approaches with both a multiple scaled distribution and its single scaled analogue as special cases. The MCGHDs represent one such approach, with the MSGHD and MGHD models as special cases. The approaches introduced herein, as well as the MGHDs, have been made freely available via the MixGHD package for R.

Future work will focus in several directions. For one, it will be interesting to study the performance of the approaches introduced herein within the fractionally supervised classification framework (Vrbik and McNicholas 2015; Gallaugher and McNicholas 2019a). The extension of these approaches to the matrix-variate paradigm is also of interest and may proceed in an analogous fashion to the work of Gallaugher and McNicholas (2018, 2019b). Finally, the models introduced herein can be extended to account for missing data (see Wei et al.2019).