A Mixture of Coalesced Generalized Hyperbolic Distributions

Tortora, Cristina; Franczak, Brian C.; Browne, Ryan P.; McNicholas, Paul D.

doi:10.1007/s00357-019-09319-3

A Mixture of Coalesced Generalized Hyperbolic Distributions

Published: 22 April 2019

Volume 36, pages 26–57, (2019)
Cite this article

Download PDF

Journal of Classification Aims and scope Submit manuscript

A Mixture of Coalesced Generalized Hyperbolic Distributions

Download PDF

Cristina Tortora¹,
Brian C. Franczak²,
Ryan P. Browne³ &
…
Paul D. McNicholas ORCID: orcid.org/0000-0002-2482-523X⁴

358 Accesses
23 Citations
Explore all metrics

Abstract

A mixture of multiple scaled generalized hyperbolic distributions (MMSGHDs) is introduced. Then, a coalesced generalized hyperbolic distribution (CGHD) is developed by joining a generalized hyperbolic distribution with a multiple scaled generalized hyperbolic distribution. After detailing the development of the MMSGHDs, which arises via implementation of a multi-dimensional weight function, the density of the mixture of CGHDs is developed. A parameter estimation scheme is developed using the ever-expanding class of MM algorithms and the Bayesian information criterion is used for model selection. The issue of cluster convexity is examined and a special case of the MMSGHDs is developed that is guaranteed to have convex clusters. These approaches are illustrated and compared using simulated and real data. The identifiability of the MMSGHDs and the mixture of CGHDs are discussed in an appendix.

Parsimonious ultrametric Gaussian mixture models

Article Open access 01 April 2024

Carlo Cavicchia, Maurizio Vichi & Giorgia Zaccaria

Flexible-dimensional L-statistic for mean estimation of symmetric distributions

Article Open access 06 April 2024

Juan Baz, Diego García-Zamora, … Luis Martínez

Novel sampling method for the von Mises–Fisher distribution

Article 26 March 2024

Seungwoo Kang & Hee-Seok Oh

1 Introduction

Finite mixture models have been linked with clustering since the idea of defining a cluster in terms of a component in finite mixture model was put forth more than 60 years ago (see McNicholas 2016a, Section 2.1). Nowadays, mixture model–based clustering is a popular approach to clustering. A random vector X arises from a finite mixture model if, for all x ⊂ X, its density can be written

$$f(\mathbf{x}\mid\boldsymbol{\vartheta})= \sum\limits_{g = 1}^{G} \pi_{g} f_{g}(\mathbf{x}\mid\boldsymbol{\theta}_{g}),$$

where π_g > 0 such that ${\sum }_{g = 1}^{G} \pi _{g} = 1$ are the mixing proportions, f_g(x∣θ_g) is the g th component density, and 𝜗 = (π, θ₁, … , θ_G) denotes the vector of parameters with π = (π₁, … , π_G). The component densities f₁(x∣θ₁), … , f_G(x∣θ_G) are typically taken to be of the same type, most commonly multivariate Gaussian. In fact, until a few years after the turn of the century, almost all work on clustering and classification using mixture models had been based on Gaussian mixture models (e.g., Banfield and Raftery 1993; Celeux and Govaert 1995; Celeux and Govaert 1995; Ghahramani and Hinton 1997; Tipping and Bishop 1999; McLachlan and Peel 2000; Fraley and Raftery2002).

Early work on non-Gaussian mixtures was on mixtures of multivariate t-distributions (e.g., Peel and McLachlan 2000). A little beyond the turn of the century, work on t-mixtures burgeoned into a substantial subfield of mixture model-based classification (e.g., McLachlan et al. 2007; Andrews and McNicholas2011a, b, 2012; Baek and McLachlan 2011; Steane et al. 2012; Lin et al. 2014; Pesevski et al. 2018). Around the same time, work on mixtures of skewed distributions took off, including work on skew-normal mixtures (e.g., Lin 2009), skew-t mixtures (e.g., Lin 2010; Vrbik and McNicholas 2012, 2014; Lee and McLachlan 2013a, b; Murray et al. 2014), Laplace mixtures (e.g., Franczak et al.2014), variance-gamma mixtures (McNicholas et al. 2017), generalized hyperbolic mixtures (Browne and McNicholas 2015), and other non-elliptically contoured distributions (e.g., Karlis and Santourian 2009; Murray et al. 2017; Tang et al. 2018). A thorough review of work on model-based clustering is given by McNicholas (2016b).

More recently, mixtures of multiple scaled distributions have been considered (Section 2). In the present manuscript, a multiple scaled generalized hyperbolic distribution is introduced (Section 3). Because the (multivariate) generalized hyperbolic distribution is not a special case of the multiple scaled generalized hyperbolic distribution, a coalesced generalized hyperbolic distribution is also developed (Section 4). The issue of cluster convexity, which has essentially been ignored in work to date on multiple scaled distributions is discussed and a special case of the multiple scaled generalized hyperbolic distribution is developed to guarantee that the components, i.e., the clusters, are convex (Section 5).

2 Multiple Scaled Distributions

The distribution of a p-dimensional random variable X is said to be a normal variance-mean mixture if its density can be written in the form

$$ f(\mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Sigma},\boldsymbol{\alpha},\boldsymbol{\theta})= {\int}_{0}^{\infty}{\phi_{p}\left( \mathbf{x}\mid\boldsymbol{\mu}+w\boldsymbol{\alpha},w\boldsymbol{\Sigma}\right)h\left( w\mid\boldsymbol{\theta}\right)dw}, $$

(1)

where ϕ_p (x∣μ + wα, wΣ) is the density of a p-dimensional Gaussian distribution with mean μ + wα and covariance matrix wΣ, and h (w∣θ) is the density of a univariate random variable W > 0 that has the role of a weight function (see Barndorff-Nielsen et al. 1982; Gneiting1997). This weight function can take on many forms, some of which lead to density representations for well-known non-Gaussian distributions, e.g., if h (w∣θ) is the density of an inverse-gamma random variable with parameters (ν/2, ν/2), then (1) is a representation of the skew-t distribution with ν degrees of freedom (see Demarta and McNeil 2005; Murray et al. 2014). Further details on, and examples of, normal variance-mean mixtures are given by Barndorff-Nielsen (1978), Kotz et al. (2001), and Kotz and Nadarajah (2004), among others.

Now, the density of W > 0 from an inverse-gamma distribution with parameters (α, β) is given by

$$ h\left( w\mid\alpha,\beta\right)=w^{-\alpha-1}\frac{\beta^{\alpha}}{{\Gamma}\left( \alpha\right)}\exp\left\{-\frac{\beta}{ w}\right\}, $$

(2)

where Γ(⋅) is the Gamma function. Setting α = β = ν/2 in (2) gives

$$ h\left( w\mid\nu/2,\nu/2\right)=w^{-\nu/2-1}\frac{(\nu/2)^{\nu/2}}{{\Gamma}\left( \nu/2\right)}\exp\left\{-\frac{\nu}{2w}\right\}, $$

(3)

Setting α = 0 in (1), and using (3) for h (w∣θ), it follows that the density of the multivariate t-distribution with ν degrees of freedom can be written

$$ \begin{array}{@{}rcl@{}} f_{t}(\mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Sigma},\nu) &=&{\int}_{0}^{\infty}{\phi_{p}\left( \mathbf{x}\mid\boldsymbol{\mu},w \boldsymbol{\Sigma}\right)h\left( w\mid\nu/2,\nu/2\right)dw}\\ &=&\frac{{\Gamma} \left( [{\nu+p}]/{2} \right) |\boldsymbol{\Sigma}|^{-{1}/{2}}}{(\pi \nu)^{{p}/{2}} {\Gamma} \left( {\nu}/{2} \right) \left[1 + {\delta(\mathbf{x}, \boldsymbol{\mu} | \boldsymbol{\Sigma})}/{\nu} \right]^{{(\nu + p)}/{2}}} \end{array} $$

(4)

where δ (x, μ∣Σ) is the squared Mahalanobis distance between x and μ. Forbes and Wraith (2014) show that a multi-dimensional weight variable

$$\boldsymbol{\Delta}_{\mathbf{W}} = \text{diag}\left( w_{1}^{-1},\dots,w_{p}^{-1}\right)$$

can be incorporated into Eq. 1 via an eigen-decomposition of the symmetric positive-definite matrix Σ. Specifically, they set Σ = ΓΦΓ′, where Γ is a p × p matrix of eigenvectors and Φ is a p × p diagonal matrix containing the eigenvalues of Σ. It follows that the density of X becomes

$$ \begin{array}{@{}rcl@{}} &&\!\!\!\!\!f\left( \mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Gamma}, \boldsymbol{\Phi},\boldsymbol{\alpha},\boldsymbol{\theta}\right)=\\ &&{\int}_{0}^{\infty}\cdots{\int}_{0}^{\infty}\phi_{p}\left( \mathbf{x}\mid\boldsymbol{\mu}+\boldsymbol{\Delta}_{\mathbf{W}}\boldsymbol{\alpha}, \boldsymbol{\Gamma}\boldsymbol{\Delta}_{\mathbf{W}}\boldsymbol{\Phi}\boldsymbol{\Gamma}^{\prime} \right) h_{\mathbf{W}}\left( w_{1},\dots,w_{p}\mid\boldsymbol{\theta}\right)dw_{1}{\dots} dw_{p}, \end{array} $$

(5)

where

$$h_{\mathbf{W}}\left( w_{1},\dots,w_{p}\mid\boldsymbol{\theta}\right) = h\left( w_{1}\mid\boldsymbol{\theta}_{1}\right)\times\dots\times h\left( w_{p}\mid\boldsymbol{\theta}_{p}\right)$$

is a p-dimensional density such that the random variables W₁, … , W_p are independent, i.e., the weights are independent. The density given in Eq. 5 adds flexibility to normal variance-mean mixtures because the parameters θ₁, … , θ_p are free to vary in each dimension. Using the density in Eq. 5, Forbes and Wraith (2014) derive the density of a multiple scaled multivariate-t distribution, Wraith and Forbes (2015) derive a multiple scaled normal-inverse Gaussian distribution, and Franczak et al. (2015) develop a multiple scaled shifted asymmetric Laplace distribution.

Setting Σ = ΓΦΓ′, it follows from Eq. 5 that the density of a multiple scaled analogue of Eq. 4 can be written

$$ f_{t\text{\tiny{MS}}}(\mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Gamma},\boldsymbol{\Phi},\boldsymbol{\nu}) = {\int}_{0}^{\infty}\cdots{\int}_{0}^{\infty}\phi_{p}\left( \mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Gamma}\boldsymbol{\Phi}\boldsymbol{\Delta}_{\mathbf{W}}\boldsymbol{\Gamma}^{\prime}\right)h_{\mathbf{W}}\left( w_{1},\dots,w_{p}\mid\boldsymbol{\nu}\right)dw_{1}{\dots} dw_{p}, $$

(6)

where $\boldsymbol {\Delta }_{\mathbf {W}} = \text {diag}\left (w_{1}^{-1},\dots ,w_{p}^{-1}\right )$ and the weight function

$$h_{\mathbf{W}}\left( w_{1},\dots,w_{p}\mid\boldsymbol{\nu}\right) = h\left( w_{1}\mid\nu_{1}/2,\nu_{1}/2\right)\times\dots\times h\left( w_{p}\mid\nu_{p}/2,\nu_{p}/2\right)$$

is a p-dimensional Gamma density, where h (w_j∣ν_j/2, ν_j/2) is given by Eq. 3. Note that the scaled Gaussian density in Eq. 6 can be written

$$ \begin{array}{@{}rcl@{}} \phi_{p}\left( \mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Gamma}\boldsymbol{\Phi}\boldsymbol{\Delta}_{\mathbf{W}}\boldsymbol{\Gamma}^{\prime}\right) &=& \prod\limits_{j = 1}^{p}{\phi_{1}\left( [\boldsymbol{\Gamma}^{\prime}\mathbf{x}]_{j} \mid [\boldsymbol{\Gamma}^{\prime}\boldsymbol{\mu}]_{j}, {\Phi}_{j}w_{j}^{-1}\right)} \\&=& \prod\limits_{j = 1}^{p}{\phi_{1}\left( [\boldsymbol{\Gamma}^{\prime}(\mathbf{x}-\boldsymbol{\mu})]_{j}\mid0,{\Phi}_{j}w_{j}^{-1} \right)} , \end{array} $$

(7)

where $\phi _{1}\left ([\boldsymbol {\Gamma }^{\prime }(\mathbf {x}-\boldsymbol {\mu })]_{j}\mid 0,{\Phi }_{j}w_{j}^{-1} \right ) $ is the density of a univariate Gaussian distribution with mean 0 and variance ${\Phi }_{j}w_{j}^{-1}$, [Γ′ (x − μ)]_j is the j th element of Γ′(x −μ), and Φ_j is the j th eigenvalue of Φ, i.e., the j th diagonal element of the matrix Φ. It follows that Eq. 6 can be written

$$ f_{t\text{\tiny{MS}}}(\mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Gamma},\boldsymbol{\Phi},\boldsymbol{\nu}) = \prod\limits_{j = 1}^{p}{\int}_{0}^{\infty}{\phi_{1}\left( [\boldsymbol{\Gamma}^{\prime}(\mathbf{x}-\boldsymbol{\mu})]_{j}\mid 0, {\Phi}_{j}w_{j}^{-1} \right)h\left( w_{j}\mid\nu_{j}/2,\nu_{j}/2\right)dw_{j}}. $$

(8)

Solving the integral in Eq. 8 gives the density of a multiple scaled multivariate-t distribution,

$$ f_{t\text{\tiny{MS}}}(\mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Gamma},\boldsymbol{\Phi},\boldsymbol{\nu})=\prod\limits_{j = 1}^{p}{\frac{{\Gamma}([\nu_{j}+ 1]/2)}{{\Gamma}(\nu_{j}/2)({\Phi}_{j}\nu_{j}\pi)^{1/2}} \left[1+\frac{[\boldsymbol{\Gamma}^{\prime}(\mathbf{x}-\boldsymbol{\mu})]_{j}^{2}}{{\Phi}_{j}\nu_{j}}\right]^{-(\nu_{j}+ 1)/2}}, $$

(9)

where Φ_j is the j th eigenvalue of Φ, Γ is a matrix of eigenvectors, μ is a location parameter, and $[\boldsymbol {\Gamma }^{\prime }(\mathbf {x}-\boldsymbol {\mu })]_{j}^{2}/{\Phi }_{j}$ can be regarded as the squared Mahalanobis distance between x and μ.

The main difference between the traditional multivariate-t density given in Eq. 4 and the multiple scaled multivariate-t density given in Eq. 9 is that the degrees of freedom can now be parameterized separately in each dimension j. Therefore, unlike the standard multivariate-t distribution, the multiple scaled density in Eq. 9 can account for different tail weight in each dimension (Forbes and Wraith 2014).

3 Mixture of Multiple Scaled Generalized Hyperbolic Distributions

There are different ways to formulate the density of a generalized hyperbolic distribution (GHD; see McNeil et al. 2005, for example). Browne and McNicholas (2015) use a mixture of GHDs (MGHDs) for clustering and they use the following formulation for the density of a p-dimensional random vector X from a GHD:

$$ f_{\text{GH}}(\mathbf{x}\mid\boldsymbol{\theta}) =\left[ \frac{ \omega + \delta\left( \mathbf{x}, \boldsymbol{\mu}| \mathbf{\Sigma}\right)} { \omega+ \boldsymbol{\alpha}^{\prime}\boldsymbol{\Sigma}^{-1}\boldsymbol{\alpha}} \right]^{(\lambda-{p}/{2})/2}\frac{ K_{\lambda - {p}/{2}}\left( \sqrt{\left[ \omega+ \boldsymbol{\alpha}^{\prime}\boldsymbol{\Sigma}^{-1}\boldsymbol{\alpha}\right]\left[\omega +\delta\left( \mathbf{x}, \boldsymbol{\mu}| \boldsymbol{\Sigma}\right)\right]}\right)}{ \left( 2\pi\right)^{{p}/{2}} \left| \boldsymbol{\Sigma} \right|^{{1}/{2}} K_{\lambda}\left( \omega\right)\exp\left\{-\left( \mathbf{x}-\boldsymbol{\mu}\right)'\boldsymbol{\Sigma}^{-1}\boldsymbol{\alpha}\right\}}, $$

(10)

where $\boldsymbol {\mu }\in \mathbb {R}^{p}$ is the location parameter, $\boldsymbol {\alpha }\in \mathbb {R}^{p}$ is the skewness parameter, $\boldsymbol {\Sigma }\in \mathbb {R}^{p\times p}$ is the scale matrix, $\lambda \in \mathbb {R}$ is the index parameter, $\omega \in \mathbb {R}^{+}$ is the concentration parameter, and K_λ is the modified Bessel function of the third kind with index λ. Now, X∣w ∽N(μ + wα, wΣ) and W∣x ∼GIG(ω + α′Σ^− 1α, ω + δ (x, μ|Σ), λ − p/2),where W ∼GIG(a, b, λ) denotes that W follows a generalized inverse Gaussian (GIG) distribution with density formulated as

$$ q(w | a,b,\lambda) = \frac{(a/b)^{\lambda/2}w^{\lambda-1}}{2K_{\lambda}(\sqrt{ab})} \exp\left\{-\frac{aw+b/w}{2}\right\}, $$

(11)

for w > 0, where $a,b\in \mathbb {R}^{+}$, and $\lambda \in \mathbb {R}$. The GIG distribution has some attractive properties including the tractability of the following expected values:

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}\left[ W \right] = \sqrt{\frac{b}{a}}\frac{K_{\lambda+ 1}\left( \sqrt{ab}\right)}{K_{\lambda}(\sqrt{ab})}, \qquad\qquad\mathbb{E}\left[{1}/{W}\right] = \sqrt{\frac{a}{b}}\frac{K_{\lambda+ 1}\left( \sqrt{ab}\right)}{K_{\lambda}\left( \sqrt{ab}\right)} -\frac{2\lambda}{b}, \end{array} $$

(12)

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}[\log W] =\log\left( \sqrt{\frac{b}{a}}\right)+\frac{1}{K_{\lambda}\left( \sqrt{ab}\right)}\frac{\partial}{\partial\lambda}K_{\lambda}\left( \sqrt{ab}\right). \end{array} $$

(13)

Write X ∼GHD(μ, Σ, α, ω, λ) to denote that the p-dimensional random variable X has the density in Eq. 10.

Now, note that the formulation of the GHD in Eq. 10 can be written as a normal variance-mean mixture where the univariate density is GIG, i.e.,

$$ \mathbf{X} = \boldsymbol{\mu} + W\boldsymbol{\alpha}+ \sqrt{W} \mathbf{V}, $$

(14)

where V ∼N(0, Σ) and W has density

$$ h(w\mid\omega,1,\lambda)=\frac{w^{\lambda-1}}{2 K_{\lambda}(\omega)}\exp{\left\{-\frac{\omega}{2}\left( w + \frac{1}{w}\right)\right\}}, $$

(15)

for w > 0, where ω and λ are as previously defined. Note that Eq. 15 is just an alternative parameterization of the GIG distribution. From Eqs. 14 and 15, it follows that the generalized hyperbolic density can be written

$$ f(\mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Sigma},\boldsymbol{\alpha},\omega,\lambda)= {\int}_{0}^{\infty}{\phi_{p}\left( \mathbf{x}\mid\boldsymbol{\mu}+w\boldsymbol{\alpha},{w}\boldsymbol{\Sigma}\right)h(w\mid\omega,1,\lambda)dw}. $$

(16)

We can use Eqs. 5 and 16 and an alternative parameterization to write the density of a multiple scaled generalized hyperbolic distribution (MSGHD) as

$$ \begin{array}{@{}rcl@{}} &&{}f_{\text{\tiny{MSGHD}}}(\mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Gamma},\boldsymbol{\Phi},\boldsymbol{\alpha},\boldsymbol{\omega},\boldsymbol{\lambda})=\\ &&{\int}_{0}^{\infty}\dots{\int}_{0}^{\infty}\phi_{p}\left( \boldsymbol{\Gamma}^{\prime} \mathbf{x} - \boldsymbol{\mu} - \boldsymbol{\Delta}_{\mathbf{w}} \boldsymbol{\alpha} \mid \mathbf{0}, \boldsymbol{\Delta}_{\mathbf{w}} \boldsymbol{\Phi} \right) h_{\mathbf{w}}(w_{1},\dots,w_{p}\mid\boldsymbol{\omega},\mathbf{1},\boldsymbol{\lambda})dw_{1}{\dots} dw_{p}, \end{array} $$

(17)

where ω = (ω₁, … , ω_p)′, λ = (λ₁, … , λ_p)′, 1 is a p-vector of 1s, and

$$h_{\mathbf{W}}(w_{1},\dots,w_{p}\mid\boldsymbol{\omega},\mathbf{1},\boldsymbol{\lambda}) = h(w_{1}\mid\omega_{1},1,\lambda_{1})\times\dots\times h(w_{p}\mid\omega_{p},1,\lambda_{p}). $$

From Eqs. 6 and 7, it follows that Eq. 17 can be written

$$ \begin{array}{@{}rcl@{}} &&f_{\text{\tiny{MSGHD}}}\left( \mathbf{x}\mid\boldsymbol{\mu},\boldsymbol{\Gamma},\boldsymbol{\Phi},\boldsymbol{\alpha},\boldsymbol{\omega},\boldsymbol{\lambda}\right) =\prod\limits_{j = 1}^{p}{\int}_{0}^{\infty}\phi_{1}\left( \left[\boldsymbol{\Gamma}^{\prime} \mathbf{x}-\boldsymbol{\mu}-\boldsymbol{\Delta}_{\mathbf{w}} \boldsymbol{\alpha} \right]_{j}\mid 0,{\Phi}_{j}w_{j} \right)\\&&\quad\left.\times ~h_{W}(w_{j}\mid\omega_{j},1,\lambda_{j}\right)dw_{j}\\ &&=\prod\limits_{j = 1}^{p}\left\{\left[\frac{\omega_{j}+ {\Phi}_{j}^{-1}\left( \left[\boldsymbol{\Gamma}^{\prime}\mathbf{x}\right]_{j}-\mu_{j}\right)^{2}}{\omega_{j}+ {\alpha_{j}^{2}} {{\Phi}_{j}}^{-1}} \right]^{\frac{\lambda_{j}-{1}/{2}}{2}}\right.\\&&\quad\times\left.\frac{K_{\lambda_{j}-{1}/{2}}\left( \sqrt {\left[\omega_{j}+{\alpha_{j}^{2}} {{\Phi}_{j}}^{-1}\right]\left[\omega_{j}+ {\Phi}_{j}^{-1}\left( \left[\boldsymbol{\Gamma}^{\prime}\mathbf{x}\right]_{j}-\mu_{j}\right)^{2}\right]}\right)}{(2\pi)^{{1}/{2}}{{\Phi}_{j}}^{{1}/{2}}K_{\lambda_{j}}(\omega_{j})\exp{\left\{-\left( \left[\boldsymbol{\Gamma}^{\prime}\mathbf{x}\right]_{j}-\mu_{j}\right){{\Phi}_{j}^{-1}\alpha_{j}}\right\}}}\vphantom{}\right\}, \end{array} $$

where [Γ′x]_j is the j th element of the vector Γ′x, μ_j is the j th element of the location parameter μ, α_j is the j th element of the skewness parameter α, Γ is a p × p matrix of eigenvectors, Φ_j is the j th eigenvalue of the diagonal matrix Φ, ω = (ω₁, … , ω_p)′ controls the concentration in each dimension p, and λ = (λ₁, … , λ_p)′ is a p-dimensional index parameter. Write X ∽MSGHD(μ, Γ, Φ, α, ω, λ) to indicate that the random vector X follows an MSGHD with density f_MSGHD(x∣μ, Γ, Φ, α, ω, λ). Then, a mixture of MSGHDs (MMSGHDs) has density

$$ f(\mathbf{x}\mid\boldsymbol{\boldsymbol{\vartheta}})= \sum\limits_{g = 1}^{G} \pi_{g} f_{\text{\tiny{MSGHD}}}\left( \mathbf{x}\mid\boldsymbol{\mu}_{g},\boldsymbol{\Gamma}_{g}, \boldsymbol{\Phi}_{g}, \boldsymbol{\alpha}_{g},\boldsymbol{\omega}_{g},\boldsymbol{\lambda}_{g}\right). $$

(18)

The identifiability of the MMSGHD is discussed in Appendix 3.

4 Mixture of Coalesced Generalized Hyperbolic Distributions

Note that the generalized hyperbolic distribution is not a special or limiting case of the MSGHD under any parameterization with p > 1. Motivated by this, consider a coalesced generalized hyperbolic distribution (CGHD) that contains both the generalized hyperbolic distribution and MSGHD as limiting cases. The CGHD arises through the introduction of a random vector

$$ \mathbf{R}=U\mathbf{X}+(1-U)\mathbf{S}, $$

(19)

where X = ΓY, Y ∽GHD (μ, Σ, α, ω₀, λ₀), Σ = ΓΦΓ′, S ∽MSGHD(μ, Γ, Φ, α, ω, λ), and U is an indicator variable such that

$$U=\left\{\begin{array}{ll} 1 & \text{ if\ } \mathbf{R} \text{ follows a generalized hyperbolic distribution, and}\\ 0 & \text{ if\ } \mathbf{R} \text{ follows a MSGHD.} \end{array}\right.$$

It follows that $\mathbf {X} = \boldsymbol {\Gamma }\boldsymbol {\mu } + W\boldsymbol {\Gamma }\boldsymbol {\alpha } + \sqrt {W}\boldsymbol {\Gamma }\mathbf {V}, $where ΓV ∽N_p (0, ΓΦΓ′), S = Γμ + ΓαΔ_w + ΓA,where ΓA ∽N_p (0, ΓΔ_wΦΓ′), and the density of R can be written

$$ \begin{array}{ll} f_{\text{\tiny{CGHD}}}(\mathbf{r}\mid\boldsymbol{\mu},&\boldsymbol{\Gamma}, \boldsymbol{\Phi},\boldsymbol{\alpha},\boldsymbol{\omega},\boldsymbol{\lambda},\omega_{0},\lambda_{0},\varpi)\\ &= \varpi f_{\text{\tiny{GHD}}}\left( \mathbf{r}\mid\boldsymbol{\mu},\boldsymbol{\Gamma} \boldsymbol{\Phi}\boldsymbol{\Gamma}^{\prime} ,\boldsymbol{\alpha},\omega_{0},\lambda_{0}\right)+ (1-\varpi)f_{\text{\tiny{MSGHD}}}\left( \mathbf{r}\mid\boldsymbol{\mu},\boldsymbol{\Gamma},\boldsymbol{\Phi},\boldsymbol{\alpha},\boldsymbol{\omega},\boldsymbol{\lambda}\right), \end{array} $$

(20)

where f_GHD(⋅) is the density of a generalized hyperbolic random variable, f_MSGHD(⋅) is the density of a MSGHD random variable, and ϖ ∈ (0, 1) is a mixing proportion. Note that the random vector R would be distributed generalized hyperbolic if ϖ = 1 and would be distributed MSGHD if ϖ = 0. However, we must restrict ϖ ∈ (0, 1) for identifiability reasons. Specifically, if ϖ = 0 then the value of f_CGHD(r∣μ, Γ, Φ, α, ω, λ, ω₀, λ₀, ϖ) will be the same for any ω₀ and λ₀ whereas, if ϖ = 1, then the value of f_CGHD(r∣μ, Γ, Φ, α, ω, λ, ω₀, λ₀, ϖ) will be the same for any ω and λ. The parameters μ, α, Γ, and Φ are the same for both densities, the parameters ω₀ and λ₀ are univariate values peculiar to the generalized hyperbolic distribution, and the p-dimensional parameters ω and λ are peculiar to the MSGHD. Write R ∽CGHD(μ, Γ, Φ, α, ω, λ, ω₀, λ₀, ϖ) to indicate that the random vector R follows a CGHD with density in Eq. 20.

A mixture of CGHDs (MCGHDs) has density

$$ f(\mathbf{x}\mid\boldsymbol{\vartheta})= \sum\limits_{g = 1}^{G} \pi_{g}f_{\text{\tiny{CGHD}}}\left( \mathbf{x}\mid\boldsymbol{\mu}_{g},\boldsymbol{\Gamma}_{g},\boldsymbol{\Phi}_{g},\boldsymbol{\alpha}_{g},\boldsymbol{\omega}_{g},\boldsymbol{\lambda}_{g},\omega_{0g},\lambda_{0g},\varpi_{g}\right). $$

Parameter estimation can be carried out via a generalized expectation-maximization (GEM) algorithm (Dempster et al. 1977). There are four sources of missing data: the latent w_0ig, the multi-dimensional weights Δ_wig = diag(w_1ig, … , w_pig), the component membership labels z_ig, and the inner component labels u_ig, for i = 1, … , n and g = 1, … , G. As usual, z_ig = 1 if observation i belongs to component g and z_ig = 0 otherwise. Similarly, u_ig = 1 if observation i, in component g, is distributed generalized hyperbolic and u_ig = 0 if observation i, in component g, is distributed MSGHD. It follows that the complete-data log-likelihood for the MCGHD is

$$ \begin{array}{@{}rcl@{}} l_{\text{c}}& =& \sum\limits_{i = 1}^{n} \sum\limits_{g = 1}^{G} \left\{\vphantom{\sum\limits_{j = 1}^{p}} z_{ig} \log \pi_{g} + z_{ig}{u_{ig}} \log \varpi_{g} + z_{ig}(1 - u_{ig}) \log (1-\varpi_{g}) \right.\\ & &+ z_{ig}{u_{ig}} \log h \left( w_{0ig} | \omega_{0g}, 1, \lambda_{0g} \right)+ z_{ig}(1 - u_{ig}) \sum\limits_{j = 1}^{p}\log h \left( w_{jig} | \omega_{jg}, 1, \lambda_{jg} \right) \\ &&+ z_{ig}{u_{ig}} \log \phi_{p} \left( \boldsymbol{\Gamma}_{g}^{\prime}\mathbf{x}_{i} | \boldsymbol{\mu}_{g} + w_{0ig} \boldsymbol{\alpha}_{g} , w_{0ig} \boldsymbol{\Phi} \right)\\ &&\left. + z_{ig}(1 - u_{ig}) \sum\limits_{j = 1}^{p} \log \phi_{1}\left( [\boldsymbol{\Gamma}_{g}^{\prime}\mathbf{x}_{i} ]_{j} | \mu_{jg} + w_{jig} \alpha_{jg} , \omega_{jg} \phi_{jg} \right)\right\}. \end{array} $$

Further details on parameter estimation (Appendix 1) and the identifiability of the MCGHD (Appendix 3) are discussed in appendices.

5 Cluster Convexity

The definition of clusters has been discussed quite extensively in the literature. Recently, Hennig (2015) provides a very interesting discussion of some potential characteristics that clusters may have. Even though they cannot always be observed—and, in some situations, desired characteristics may even conflict—such characteristics provide important links and contrasts between different definitions that are used for a cluster. In fact, the idea of desirable characteristics is not a new one, e.g., Cormack (1971) gives internal cohesion and external isolation as two “basic ideas” in this direction. In this paper, we follow the definition given by McNicholas (2016a): “a cluster is a unimodal component within an appropriate finite mixture model.” The term appropriate means that the model has the flexibility to fit the data and, in many cases, this also means that each cluster is convex (see McNicholas 2016a, Section 9.1). The development of flexible models outlined herein may make it a little easier to find an “appropriate” component density.

The MSGHD is more flexible than the GHD; however, similar to the multiple scaled multivariate t-distribution of Forbes and Wraith (2014), the MSGHD can have contours that are not convex. Accordingly, the MMSGHD can have components that are non-convex, leading to non-convex clusters. Consider the data in Fig. 1. How many clusters are there? The most plausible answer to this question is two overlapping clusters: one with positive correlation between the variables and another with negative correlation between the variables. There may also be an argument for four or five. Note that the data are generated from a G = 2 component mixture of multivariate t-distributions.

The MMSGHD is fitted to these data for G = 1, … , 5 and, as is common in model-based clustering applications, the Bayesian information criterion (BIC; Schwarz 1978) is used to select G. Note that the BIC is given by

$$\text{BIC}= 2l(\hat{\boldsymbol{\vartheta}})-\rho\log n,$$

where $l(\hat {\boldsymbol {\vartheta }})$ is the maximized log-likelihood, ρ is the number of free parameters, and n is the number of observations. For the MMSGHD, the BIC selects a G = 1 component model (Fig. 1). Furthermore, looking at the G = 2 component MSGHD solution (Fig. 1) confirms that the problem is not just one of model selection; the G = 2 component MSGHD solution selects one component that is roughly elliptical and another that is not convex. On the other hand, forcing the MSGHD to be convex—which can be done by imposing the constraint λ_j > 1, for j = 1, … , p—leads to what we call the convex MSGHD (cMSGHD). Fitting the corresponding mixture of cMSGHDs (McMSGHDs) ensures that, if each component is associated with a cluster, then convex clusters are guaranteed. Results in a G = 2 component model being selected (Fig. 1). Formally, this amounts to insuring that the MSGHD is quasi-convex (see Appendix 2). The general point here is that if convexity is not enforced, then the MSGHD can give components that contain multiple clusters. While it is easy to spot this in two dimensions, e.g., Fig. 1, this phenomenon may go unrecognized in higher dimensions, possibly resulting in greatly misleading results. Of course, the issue of non-convex clusters does not arise with most model-based approaches; however, when multiple scaled mixtures are considered, the issue can crop up. Another example in a similar vein is given in Fig. 2, where data are generated from a G = 3 component mixture of multivariate t-distributions. The selected MMSGHD has G = 2 components, including one clearly non-convex cluster, while the McMSGHD gives sensible clustering results.

The intention behind the introduction of the McMSGHD is not that it should supplant the MMSGHD, but rather that it provides a convenient check on the number of components. In a higher dimensional application, where visualization is difficult or impossible, situations where the selected MMSGHD has fewer components than the selected McMSGHD will deserve special attention. Of course, this is not to say that the selected McMSGHD will always have more components in situations where the MSGHD has too few, but rather that it will help to avoid the sort of situations depicted in Figs. 1 and 2. Note that parameter estimation for the McMSGHD is analogous to the algorithm for the MMSGHD algorithm but λ_kj (k = 0, 1, … , G, j = 1, … , p) is updated only if the value of its update exceeds 1.

6 Illustrations

6.1 Implementation and Evaluation

In the following illustrations, we fit the MGHDs, the MMSGHDs, the McMSGHDs, and the MCGHDs using the corresponding functions available in the MixGHD package (Tortora et al. 2017) for R (R Core Team 2017). We use k-means and k-medoids clustering to initialize the $\hat {z}_{ig}$. The adjusted Rand index (ARI; Hubert and Arabie 1985) is used to compare predicted classifications with true classes. The ARI corrects the Rand index (Rand 1971) for chance, its expected value under random classification is 0, and it takes a value of 1 when there is perfect class agreement. Steinley (2004) gives guidelines for interpreting ARI values.

Several mixtures of skewed distributions have been proposed for model-based clustering and classification. Among them, three different formulations of the multivariate skew-t distribution have been used for clustering in the mixture setting. The formulation used by Murray et al. (2014) is a special case of the generalized hyperbolic distributions. Azzalini et al. (2016) refers to the other two formulations as the classical and SDB formulations (for the initials of the authors’ names), respectively. Lee and McLachlan (2013b, 2014) compare the mixture of classical skew-t distributions to the mixture of SDB skew-t distributions (MSDBST) and to skew-normal analogues of both formulations; their results lead one to believe that the MSDBST is generally preferable (see Azzalini et al. 2016, for an alternative viewpoint). The MSDBST approach is implemented using the EMMIXuskew package (Lee and McLachlan 2013a) for R and it is used for comparison herein.

6.2 Simulation Study

We use a simulation study to measure the performance of the proposed methods. The interest is to observe the ARI under different circumstances, the data are generated using two-component mixtures of Gaussian (Scenario 1), generalized hyperbolic (Scenario 2), and multiple scaled generalized hyperbolic distributions (Scenario 3), using the R function mvrnorm and the stochastic relationships given in Eq. 14 and Section 4. For each scenario, a three-factor full factorial design was used, where the factors are medium (M) or high (H) correlation, 25% or 50% overlapping, and same (S), n₁ = n₂ = 100, or different (D), n₁ = 150 and n₂ = 50, number of elements per component. In all the scenarios, μ₁ = 1. Table 1 shows the average ARI, with standard deviation, obtained on 10 data sets generated from a Gaussian distribution with p = 10, G = 2, correlation equal to 0.33 for M and 0.66 for H, μ₂ = 2 for 25% overlapping and μ₂ = 3 for 50% overlapping. Table 2 shows the average ARI, with standard deviation, obtained on 10 data sets generated from a MGHD distribution with p = 10, G = 2, α = 1, λ = 0.5, ω = 1 and correlation equal to 0.25 for M and 0.5 for H, μ₂ = 10 for 25% overlapping and μ₂ = 15 for 50% overlapping. Table 3 shows the average ARI, with 1 standard deviation, obtained on 10 data sets generated from a MMSGHD distribution with p = 10, G = 2, α₁ = 1, α₂ = -1, λ = 0.5, ω = 1 and correlation equal to 0.25 for M and 0.35 for H, μ₂ = 19 for 25% overlapping and μ₂ = 23 for 50% overlapping. For the data sets generated using the MGHD and the MMSGHD, the values of the correlation had to be reduced in order to maintain 25% and 50% overlap, higher correlation lead to higher overlap. Figure 3 shows the scatterplots of the data simulated using the two-component MGDs, MGHDs, and MMSGHDs with high correlation, 50% overlapping and cluster of different size, where plotting symbol and color represent the true classifications.

Table 1 Average ARI values, with standard deviations, for each data set generated from MGDs, with different starting parameters

Full size table

Table 2 Average ARI values, with standard deviations, for each data set generated from MGHDs, with different starting parameters and α = 1

Full size table

Table 3 Average ARI, with standard deviations, for each data set generated from MMSGHDs, with different starting parameters and α = 1

Full size table

The MCGHDs perform at least comparably to the best of the MMSGHDs and MGHDs. As expected, the MGHDs over performs the MMSGHDs when the data are generated using a MGHDs model and vice versa when the data are generated from a MMSGHDs. The two methods perform similarly on normally distributed clusters. The generated clusters are all convex and that explains the similarity of the results obtained using MMSGHDs and McMSGHDs. The level of overlapping has a notable impact on the ARI, as expected. The performance of all the methods is slightly better when the clusters are of the same size and when there is a lower correlation.

6.3 Real Data Analysis

To assess the classification performance of the MGHDs, the MMSGHDs, the McMSGHDs, the MCGHDs, and the MSDBST, we consider four real data sets that are commonly used within the model-based clustering literature (Table 4).

Table 4 Summary details for four real data sets commonly used within the model-based clustering literature

Full size table

To compare the performance of the methods, we set the number of components, G, equal to the true number of classes. In general, the BIC is used to select the best starting criterion between k-means and k-medoids; however, because of the presence of outliers, only k-medoids starts are used for the bankruptcy data set. Table 5 displays the classification performance. Note that MSDBST is not used on the AIS data set because of the prohibitively high dimensionality (p = 11). The MCGHD generally performs comparably to the best of the other four approaches. Specifically, the MCGHD gives the best classification performance—either outright or jointly—for the four data sets. Interestingly, the MGHD gives very good classification performance on three of the four data sets; however, this must be taken in context with its very poor classification performance on the bankruptcy data set (ARI ≈ 0). It is also interesting to compare the classification performance of the McMSGHD to the MMSGHD as well as that of the MGHD to the MMSGHD. The McMSGHD and the MGHD approaches both outperform MMSGHD for one data set, and give a similar performance on two of the other three data sets. This highlights the fact that a mixture of multiple scaled distributions may well not outperform its single scaled analogue, and underlines the need for an approach with both the MSGHD and MGHD models as special cases. The results for the bankruptcy data illustrate that the MCGHD approach can give very good classification performance in situations where neither the MGHD nor the MMSGHD perform well. Finally, the MCGHD outperforms the MSDBST on two data sets and gives the same result on the third (recall that the MSDBST could not be fitted to the AIS data).

Table 5 ARI values for the MCGHD, MGHD, MMSGHD, McMSGHD, and MSDBST approaches on four real data sets

Full size table

In this analysis, we took the number of components to be known. However, in a true clustering scenario, we would not have a priori information about the number of groups. Therefore, the analysis was repeated without this assumption and, for all but the bankruptcy data set, all approaches were run for G = 1, … , 5 components. Because of the small number of observations, the bankruptcy data were run for G = 1, 2, 3. Table 6 gives the number of components selected by the BIC.

Table 6 Selected number of components using the BIC for each real data set and corresponding ARI in brackets, where the correct number of components is highlighted in italics

Full size table

For the bank note data, the BIC selects the correct number of components for every method; however, for the bankruptcy data, it picks the right number of components only for McMSGHDs. For the seed and AIS data sets, the BIC does not select the correct number of components for any approach. Recall that the MSDBST could not be fitted to the AIS data; also, it could not be fitted to the seed data for G = 4. These results illustrate that the BIC is not necessarily reliable for selecting the number of components in real data examples for any of the five approaches.

6.4 Computation Time

For the data sets listed in Table 4, we measured the elapsed user times, in seconds, to perform 100 (G)EM iterations. Note that all code was run in R version 3.0.2 on a 32-core Intel Xeon E5 server with 256GB RAM running 64-bit CentOS. Figure 4 displays the average elapsed time for two replications of each algorithm using a k-means and a k-medoids starting partition for G = 1, … , 5 components. The EM algorithm for the MSDBST is significantly slower than that of the hyperbolic-based approaches (Fig. 4). In fact, on the banknote data set, it takes the MSDBST more than 11 h to perform the required number of EM iterations when G = 2 and about 63 h when G = 5. For the seeds data set, it takes the MSDBST more than 5 h when G = 2 and more than 10 h when G = 5, whereas the G = 5 component MMSGHD, McMSGHD, and MCGHD approaches need less than 25 s. For the bankruptcy data set, the MSDBST requires approximately 80 s when G = 3, whereas the hyperbolic distributions need less than 20 s each.

7 Discussion

Novel MCGHDs, MMSGHDs, and McMSGHDs models have been introduced and applied for model-based clustering. The GHD is a flexible distribution, capable of handling skewness and heavy tails, and has many well known distributions as special or limiting cases. Furthermore, it is a normal variance-mean mixture, arising via a relationship between a multivariate Gaussian and an univariate GIG distribution. The MSGHD extends the GHD to include a multivariate GIG distribution, increasing the flexibility of the model. However, the GHD is not a special case of the MSGHD; hence, we created MCGHDs, which has both the GHD and MSGHD as special cases. The McMSGHD approach was introduced as a convex version of the MMSGHD, and this point deserves some further discussion. The extension of the multivariate-t distribution to multiple scale was carried out by Forbes and Wraith (2014); as discussed in the Appendix 2, the multiple scaled multivariate t-distribution cannot be quasi-concave, i.e., the clusters associated with a mixture of multiple scaled multivariate t-distributions cannot be convex. We have seen examples where the MMSGHD can put multiple clusters into one component, and the McMSGHD has an important role in helping to prevent this; if both approaches are fitted and lead to different numbers of components, then further attention is warranted.

Comparing the MGHD, MMSGHD, McMSGHD, and MCGHD approaches yielded some interesting results. Among them, we see that the MMSGHD does not necessarily outperform the MGHD; far from it, in fact, only the MGHD approach gave better clustering performance than the MMSGHD approach on one of the four real data sets we considered, as well as identical performance on the other two. This underlines the fact that a mixture of multiple scaled distributions may well not outperform its single scaled analogue, and highlights the benefit of approaches with both a multiple scaled distribution and its single scaled analogue as special cases. The MCGHDs represent one such approach, with the MSGHD and MGHD models as special cases. The approaches introduced herein, as well as the MGHDs, have been made freely available via the MixGHD package for R.

Future work will focus in several directions. For one, it will be interesting to study the performance of the approaches introduced herein within the fractionally supervised classification framework (Vrbik and McNicholas 2015; Gallaugher and McNicholas 2019a). The extension of these approaches to the matrix-variate paradigm is also of interest and may proceed in an analogous fashion to the work of Gallaugher and McNicholas (2018, 2019b). Finally, the models introduced herein can be extended to account for missing data (see Wei et al.2019).

References

Aitken, A.C. (1926). A series formula for the roots of algebraic and transcendental equations. Proceedings of the Royal Society of Edinburgh, 45, 14–22.
Article MATH Google Scholar
Altman, E. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. Journal of Finance, 23(4), 589–609.
Google Scholar
Andrews, J.L., & McNicholas, P.D. (2011a). Extending mixtures of multivariate t-factor analyzers. Statistics and Computing, 21(3), 361–373.
Article MathSciNet MATH Google Scholar
Andrews, J.L., & McNicholas, P.D. (2011b). Mixtures of modified t-factor analyzers for model-based clustering, classification, and discriminant analysis. Journal of Statistical Planning and Inference, 141(4), 1479–1486.
Article MathSciNet MATH Google Scholar
Andrews, J.L., & McNicholas, P. (2012). Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Statistics and Computing, 22(5), 1021–1029.
Article MathSciNet MATH Google Scholar
Azzalini, A., Browne, R.P., Genton, M.G., McNicholas, P.D. (2016). On nomenclature for, and the relative merits of, two formulations of skew distributions. Statistics and Probability Letters, 110, 201–206.
Article MathSciNet MATH Google Scholar
Baek, J., & McLachlan, G.J. (2011). Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics, 27, 1269–1276.
Article Google Scholar
Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49(3), 803–821.
Article MathSciNet MATH Google Scholar
Barndorff-Nielsen, O. (1978). Hyperbolic distributions and distributions on hyperbolae. Scandinavian Journal of Statistics, 5(3), 151–157.
MathSciNet MATH Google Scholar
Barndorff-Nielsen, O., Kent, J., Sørensen, M. (1982). Normal variance-mean mixtures and z distributions. International Statistical Review / Revue Internationale de Statistique, 50(2), 145–159.
MathSciNet MATH Google Scholar
Böhning, D., Dietz, E., Schaub, R., Schlattmann, P., Lindsay, B. (1994). The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of Statistical Mathematics, 46, 373–388.
Article MATH Google Scholar
Browne, R.P., & McNicholas, P.D. (2014). Estimating common principal components in high dimensions. Advances in Data Analysis and Classification, 8(2), 217–226.
Article MathSciNet Google Scholar
Browne, R.P., & McNicholas, P.D. (2015). A mixture of generalized hyperbolic distributions. Canadian Journal of Statistics, 43(2), 176–198.
Article MathSciNet MATH Google Scholar
Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition, 28(5), 781–793.
Article Google Scholar
Charytanowicz, M., Niewczas, J., Kulczycki, P., Kowalski, P.A., Łukasik, S., żak, S. (2010). A complete gradient clustering algorithm for features analysis of x-ray images. In Piȩtka, E., & Kawa, J. (Eds.) Information Technologies in Biomedicine, (Vol. 2 pp. 15–24). Berlin: Springer.
Cook, R.D., & Weisberg, S. (1994). An Introduction to Regression Graphics. New York: Wiley.
Book MATH Google Scholar
Cormack, R.M. (1971). A review of classification (with discussion). Journal of the Royal Statistical Society: Series A, 34, 321–367.
Article Google Scholar
Debreu, G., & Koopmans, T.C. (1982). Additively decomposed quasiconvex functions. Mathematical Programming, 24(1), 1–38.
Article MathSciNet MATH Google Scholar
Demarta, S., & McNeil, A.J. (2005). The t copula and related copulas. International Statistical Review, 73(1), 111–129.
Article MATH Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1), 1–38.
MathSciNet MATH Google Scholar
Flury, B., & Riedwyl, H. (1988). Multivariate Statistics: A Practical Approach. London: Chapman & Hall.
Book MATH Google Scholar
Forbes, F., & Wraith, D. (2014). A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweights: Application to robust clustering. Statistics and Computing, 24(6), 971–984.
Article MathSciNet MATH Google Scholar
Fraley, C., & Raftery, A.E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458), 611–631.
Article MathSciNet MATH Google Scholar
Franczak, B.C., Browne, R.P., McNicholas, P.D. (2014). Mixtures of shifted asymmetric Laplace distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6), 1149–1157.
Article Google Scholar
Franczak, B.C., Tortora, C., Browne, R.P., McNicholas, P.D. (2015). Unsupervised learning via mixtures of skewed distributions with hypercube contours. Pattern Recognition Letters, 58(1), 69–76.
Article Google Scholar
Gallaugher, M.P.B., & McNicholas, P.D. (2018). Finite mixtures of skewed matrix variate distributions. Pattern Recognition, 80, 83–93.
Article Google Scholar
Gallaugher, M.P.B., & McNicholas, P.D. (2019a). On fractionally-supervised classification: weight selection and extension to the multivariate t-distribution. Journal of Classification 36. In press.
Gallaugher, M.P.B., & McNicholas, P.D. (2019b). Three skewed matrix variate distributions. Statistics and Probability Letters, 145, 103–109.
Article MathSciNet MATH Google Scholar
Ghahramani, Z., & Hinton, G.E. (1997). The EM algorithm for factor analyzers Technical Report CRG-TR-96-1. Toronto: University Of Toronto.
Google Scholar
Gneiting, T. (1997). Normal scale mixtures and dual probability densities. Journal of Statistical Computation and Simulation, 59(4), 375–384.
Article MATH Google Scholar
Hennig, C. (2015). What are the true clusters? Pattern Recognition Letters, 63, 53–62.
Article MATH Google Scholar
Holzmann, H., Munk, A., Gneiting, T. (2006). Identifiability of finite mixtures of elliptical distributions. Scandinavian Journal of Statistics, 33, 753–763.
Article MathSciNet MATH Google Scholar
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Article MATH Google Scholar
Hunter, D.R., & Lange, K. (2000). Quantile regression via an MM algorithm. Journal of Computational and Graphical Statistics, 9(1), 60–77.
MathSciNet Google Scholar
Karlis, D., & Santourian, A. (2009). Model-based clustering with non-elliptically contoured distributions. Statistics and Computing, 19(1), 73–83.
Article MathSciNet Google Scholar
Kent, J.T. (1983). Identifiability of finite mixtures for directional data. The Annals of Statistics, 11, 984–988.
Article MathSciNet MATH Google Scholar
Kiers, H.A. (2002). Setting up alternating least squares and iterative majorization algorithms for solving various matrix optimization problems. Computational Statistics and Data Analysis, 41(1), 157–170.
Article MathSciNet MATH Google Scholar
Kotz, S., Kozubowski, T.J., Podgorski, K. (2001). The Laplace Distribution and Generalizations: A Revisit with Applications to Communications, Economics, Engineering, and Finance 1st edn: Burkhauser Boston.
Kotz, S., & Nadarajah, S. (2004). Multivariate t-distributions and their applications. Cambridge: Cambridge University Press.
Book MATH Google Scholar
Lee, S.X., & McLachlan, G.J. (2013a). EMMIXuskew: fitting unrestricted multivariate skew t Mixture Models. R package version 0.11–5.
Lee, S.X., & McLachlan, G.J. (2013b). On mixtures of skew normal and skew t-distributions. Advances in Data Analysis and Classification, 7(3), 241–266.
Article MathSciNet MATH Google Scholar
Lee, S.X., & McLachlan, G.J. (2014). Finite mixtures of multivariate skew t-distributions: some recent and new results. Statistics and Computing, 24(2), 181–202.
Article MathSciNet MATH Google Scholar
Lin, T.I. (2009). Maximum likelihood estimation for multivariate skew normal mixture models. Journal of Multivariate Analysis, 100(2), 257–265.
Article MathSciNet MATH Google Scholar
Lin, T.I. (2010). Robust mixture modeling using multivariate skew t distributions. Statistics and Computing, 20(3), 343–356.
Article MathSciNet Google Scholar
Lin, T.-I., McNicholas, P.D., Hsiu, J.H. (2014). Capturing patterns via parsimonious t mixture models. Statistics and Probability Letters, 88, 80–87.
Article MathSciNet MATH Google Scholar
Lindsay, B. (1995). Mixture models: Theory, geometry and applications. In NSF-CBMS Regional Conference Series in Probability and Statistics, Vol. 5. California: Institute of Mathematical Statistics: Hayward.
McLachlan, G.J., & Peel, D. (2000). Mixtures of factor analyzers. In Proceedings of the Seventh International Conference on Machine Learning (pp. 599–606). San Francisco: Morgan Kaufmann.
McLachlan, G.J., Bean, R.W., Jones, L. B. -T. (2007). Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution. Computational Statistics and Data Analysis, 51(11), 5327–5338.
Article MathSciNet MATH Google Scholar
McLachlan, G.J., & Krishnan, T. (2008). The EM Algorithm and Extensions. New York: Wiley.
Book MATH Google Scholar
McNeil, A.J., Frey, R., Embrechts, P. (2005). Quantitative risk management: concepts, techniques and tools. Princeton: Princeton University Press.
MATH Google Scholar
McNicholas, P.D. (2016a). Mixture Model-Based Classification. Boca-Raton: Chapman & Hall/CRC press.
Book MATH Google Scholar
McNicholas, P.D. (2016b). Model-based clustering. Journal of Classification, 33 (3), 331–373.
Article MathSciNet MATH Google Scholar
McNicholas, P.D., Murphy, T.B., McDaid, A.F., Frost, D. (2010). Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Computational Statistics and Data Analysis, 54(3), 711–723.
Article MathSciNet MATH Google Scholar
McNicholas, S.M., McNicholas, P.D., Browne, R.P. (2017). A mixture of variance-gamma factor analyzers. In Ahmed, S. E. (Ed.) Big and Complex Data Analysis: Methodologies and Applications (pp. 369–385). Cham: Springer International Publishing.
Murray, P.M., Browne, R.B., McNicholas, P.D. (2014). Mixtures of skew-t factor analyzers. Computational Statistics and Data Analysis, 77, 326–335.
Article MathSciNet MATH Google Scholar
Murray, P.M., Browne, R.B., McNicholas, P.D. (2017). Hidden truncation hyperbolic distributions, finite mixtures thereof, and their application for clustering. Journal of Multivariate Analysis, 161, 141–156.
Article MathSciNet MATH Google Scholar
Niculescu, C., & Persson, L. (2006). Convex Functions and Their Applications. New York: Springer.
Book MATH Google Scholar
Ortega, J.M., & Rheinboldt, W.C. (1970). Iterative Solutions of Nonlinear Equations in Several Variables. New York: Academic Press.
MATH Google Scholar
Peel, D., & McLachlan, G.J. (2000). Robust mixture modelling using the t distribution. Statistics and Computing, 10(4), 339–348.
Article Google Scholar
Pesevski, A., Franczak, B.C., McNicholas, P.D. (2018). Subspace clustering with the multivariate-t distribution. Pattern Recognition Letters, 112(1), 297–302.
Article Google Scholar
R Core Team. (2017). R: A Language and Environment for Statistical Computing Vienna. Austria: R Foundation for Statistical Computing.
Google Scholar
Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850.
Article Google Scholar
Rockafellar, R.T., & Wets, R.J.B. (2009). Variational Analysis. New York: Springer.
MATH Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464.
Article MathSciNet MATH Google Scholar
Steane, M.A., McNicholas, P.D., Yada, R. (2012). Model-based classification via mixtures of multivariate t-factor analyzers. Communications in Statistics – Simulation and Computation, 41(4), 510–523.
Article MathSciNet MATH Google Scholar
Steinley, D. (2004). Properties of the Hubert-Arable adjusted Rand index. Psychological methods, 9(3), 386.
Article Google Scholar
Tang, Y., Browne, R.P., McNicholas, P.D. (2018). Flexible clustering of high-dimensional data via mixtures of joint generalized hyperbolic distributions. Stat, 7 (1), e177.
Article MathSciNet Google Scholar
Tipping, M.E., & Bishop, C.M. (1999). Mixtures of probabilistic principal component analysers. Neural Computation, 11(2), 443–482.
Article Google Scholar
Tortora, C., Franczak, B.C., Browne, R.P., McNicholas, P.D. (2014). Mixtures of multiple scaled generalized hyperbolic distributions. arXiv:1403.2332v1.
Tortora, C., Browne, R.P., Franczak, B.C., McNicholas, P.D. (2017). MixGHD: model based clustering, classification and discriminant analysis using the mixture of generalized hyperbolic distributions. R package version 2.1.
Vrbik, I., & McNicholas, P.D. (2012). Analytic calculations for the EM algorithm for multivariate skew-mixture models. Statistics and Probability Letters, 82(6), 1169–1174.
Article MathSciNet MATH Google Scholar
Vrbik, I., & McNicholas, P.D. (2014). Parsimonious skew mixture models for model-based clustering and classification. Computational Statistics and Data Analysis, 71, 196–210.
Article MathSciNet MATH Google Scholar
Vrbik, I., & McNicholas, P.D. (2015). Fractionally-supervised classification. Journal of Classification, 32(3), 359–381.
Article MathSciNet MATH Google Scholar
Wei, Y., Tang, Y., McNicholas, P.D. (2019). Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data. Computational Statistics and Data Analysis, 130, 18–41.
Article MathSciNet MATH Google Scholar
Wraith, D., & Forbes, F. (2015). Clustering using skewed multivariate heavy tailed distributions with flexible tail behaviour. arXiv:http://arXiv.org/abs/1408.0711.
Yakowitz, S.J., & Spragins, J. (1968). On the identifiability of finite mixtures. Annals of Mathematical Statistics, 39, 209–214.
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors are grateful to two anonymous reviewers and the Editor for helpful comments that have improved this manuscript. This work was supported by a grant-in-aid from Compusense Inc., a Collaborative Research and Development Grant from the Natural Sciences and Engineering Research Council of Canada, and the Canada Research Chairs program. The work on the introduction of the MMSGHD presented herein was first made publicly available as an arXiv preprint (Tortora et al. 2014).

Author information

Authors and Affiliations

Department of Mathematics & Statistics, San José State University, San José, CA, USA
Cristina Tortora
Department of Mathematics & Statistics, MacEwan University, Edmonton, AB, Canada
Brian C. Franczak
Department of Statistics & Actuarial Sciences, University of Waterloo, Waterloo, ON, Canada
Ryan P. Browne
Department of Mathematics & Statistics, McMaster University, Hamilton, ON, Canada
Paul D. McNicholas

Authors

Cristina Tortora
View author publications
You can also search for this author in PubMed Google Scholar
Brian C. Franczak
View author publications
You can also search for this author in PubMed Google Scholar
Ryan P. Browne
View author publications
You can also search for this author in PubMed Google Scholar
Paul D. McNicholas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paul D. McNicholas.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Parameter Estimation

We use the EM algorithm to estimate the parameters of the MCGHDs. The EM algorithm belongs to a larger class of algorithms known as MM algorithms (Ortega and Rheinboldt 1970; Hunter and Lange 2000) and is well-suited for problems involving missing data. “MM” stands for “minorize-maximize” or “majorize-minimize,” depending on the purpose of the algorithm; in the EM context, the minorizing function is the expected value of the complete-data log-likelihood. The EM algorithm iterates between two steps, an E-step and a M-step, and has been used to estimate the parameters of mixture models in many experiments (McLachlan and Krishnan 2008). On each E-step, the expected value of the complete-data log-likelihood, $\mathcal {Q}$, is calculated and on each M-step is maximized with respect to π_g, μ_g, Φ_g, α_g, ω_g, λ_g, ω_0g, λ_0g, ϖ_g. However, in each M-step $\mathcal {Q}$ increases with respect to Γ_g rather than maximize; accordingly, the algorithm is formally a generalized EM (GEM) algorithm. For our MCGHDs, there are four sources of missing data: the latent variable W_0ig, the multi-dimensional weight variable Δ_wig, the group component indicator labels z_ig, and inner component labels u_ig, for i = 1, … , n and g = 1, … , G. For each observation i, z_ig = 1 if observation i is in component g and z_ig = 0 otherwise. Similarly, for each observation i, u_ig = 1 if observation i, in component g, is distributed generalized hyperbolic and u_ig = 0 if observation i, in component g, is distributed multiple scaled generalized hyperbolic. It follows that the complete-data log-likelihood for the MCGHDs is given by

$$ \begin{array}{@{}rcl@{}} l_{c}& =& \sum\limits_{i = 1}^{n} \sum\limits_{g = 1}^{G} \left\{\vphantom{\sum\limits_{j = 1}^{p}} z_{ig} \log \pi_{g} + z_{ig}{u_{ig}} \log \varpi_{g} + z_{ig}(1 - u_{ig}) \log (1-\varpi_{g})\right.\\ & &+ z_{ig}{u_{ig}} \log h \left( w_{0ig} | \omega_{0g}, 1, \lambda_{0g} \right)+ z_{ig}(1 - u_{ig}) \sum\limits_{j = 1}^{p}\log h \left( w_{jig} | \omega_{jg}, 1, \lambda_{jg} \right)\\&&+ z_{ig}{u_{ig}} \log \phi_{p} \left( \boldsymbol{\Gamma}_{g}^{\prime}\mathbf{x}_{i} | \boldsymbol{\mu}_{g} + w_{0ig} \boldsymbol{\alpha}_{g} , w_{0ig} \mathbf{\Phi} \right)\\ && \left.+ z_{ig}(1 - u_{ig}) \sum\limits_{j = 1}^{p} \log \phi_{1} \left( [\boldsymbol{\Gamma}_{g}^{\prime}\mathbf{x}_{i} ]_{j} | \mu_{jg} + w_{jig} \alpha_{jg} , \omega_{jg} \phi_{jg} \right)\right\}, \end{array} $$

where ϕ_p(⋅) represents a p-dimensional Gaussian density function, ϕ₁(⋅) is a unidimensional Gaussian density function, and h(⋅) is the density of a GIG distribution given in Eq. 15.

We are now prepared to outline the calculations for our GEM algorithm for the MCGHDs. On the E-step, the expected value of the complete-data log-likelihood,$ \mathcal {Q}$, is computed by replacing the sufficient statistics of the missing data by their expected values. For each component indicator label z_ig and inner component label u_ig, for i = 1, … , n and g = 1, … , G, we require the expectations

$$ \mathbb{E}\left[ Z_{ig}\mid \mathbf{x}_{i} \right] = \frac{\pi_{g}f_{\text{CGHD}}\left( \mathbf{x}\mid\boldsymbol{\mu}_{g},\boldsymbol{\Gamma}_{g},\boldsymbol{\Phi}_{g},\boldsymbol{\alpha}_{g},\boldsymbol{\omega}_{g},\boldsymbol{\lambda}_{g},\omega_{0g},\lambda_{0g},\varpi_{g}\right)}{{\sum}_{h = 1}^{G}\pi_{h}f_{\text{CGHD}}\left( \mathbf{x}\mid\boldsymbol{\mu}_{h},\boldsymbol{\Gamma}_{h},\boldsymbol{\Phi}_{h},\boldsymbol{\alpha}_{h},\boldsymbol{\omega}_{h},\boldsymbol{\lambda}_{h},\omega_{0h},\lambda_{0h},\varpi_{h}\right)} =:\hat{z}_{ig} $$

(21)

and

$$ \begin{array}{l} \mathbb{E}\left[ U_{ig}\mid \mathbf{x}_{i}, z_{ig}= 1 \right]=\\ \qquad \frac{\varpi_{g}f_{\text{GHD}}(\mathbf{x}\mid\boldsymbol{\mu}_{g},\boldsymbol{\Gamma}_{g}\boldsymbol{\Phi}_{g}\boldsymbol{\Gamma}_{g}^{\prime}, \boldsymbol{\alpha}_{g},\omega_{0g},\lambda_{0g})}{\varpi_{g}f_{\text{GHD}}(\mathbf{x}\mid\boldsymbol{\mu}_{g},\boldsymbol{\Gamma}_{g}\boldsymbol{\Phi}_{g}\boldsymbol{\Gamma}_{g}^{\prime}, \boldsymbol{\alpha}_{g},\omega_{0g},\lambda_{0g}) + (1 - \varpi_{g})f_{\text{MSGHD}}\left( \mathbf{x}\mid\boldsymbol{\mu}_{g},\boldsymbol{\Gamma}_{g},\boldsymbol{\Phi}_{g},\boldsymbol{\alpha}_{g},\boldsymbol{\omega}_{g},\boldsymbol{\lambda}_{g}\right)}\\\qquad=:\hat{u}_{ig}, \end{array} $$

(22)

where f_CGHD is given in Eq. 20, f_GHD is given in Eq. 10 and f_MSGHD is given in Eq. 17. For the latent variable W_0ig, we use the expected value given in Browne and McNicholas (2015). The authors show that, given the density in Eq. 15, the following is true

$$ \begin{array}{@{}rcl@{}} W_{0ig}\mid\mathbf{x}_{i}, z_{ig} = 1, u_{ig}\!&=&\!1 \backsim\text{GIG}\left( \omega_{0g}+\boldsymbol{\alpha}_{g}^{\prime}(\boldsymbol{\Gamma}_{g} \boldsymbol{\Phi}_{g}\boldsymbol{\Gamma}_{g}^{\prime})^{-1}\boldsymbol{\alpha}_{g},\omega_{0g}\right.\\&& \left.\qquad\qquad +\delta(\mathbf{x}_{i},\boldsymbol{\mu}_{g}\mid\boldsymbol{\Gamma}_{g} \boldsymbol{\Phi}_{g}\boldsymbol{\Gamma}_{g}^{\prime}), \lambda_{0g}-p/2\right). \end{array} $$

For the MCGHDs, the maximization of $\mathcal {Q}$ requires the expected values of W_0ig, $W^{-1}_{0ig}$, and log W_0ig, i.e.,

$$ \begin{array}{@{}rcl@{}} &&{\mathbb E}[W_{0ig} | \mathbf{x}_{i},z_{ig}= 1,u_{ig}= 1]= \sqrt{\frac{e_{ig}}{d_{g}}}\frac{K_{\lambda_{0g}-p/2 + 1}\left( \sqrt{d_{g}e_{ig}}\right)}{K_{\lambda_{0g}-p/2}\left( \sqrt{d_{g}e_{ig}}\right)} =: a_{ig},\\ &&{\mathbb E}[W_{0ig}^{-1} | \mathbf{x}_{i},z_{ig}= 1,u_{ig}= 1]= \sqrt{\frac{d_{g}}{e_{ig}}}\frac{K_{\lambda_{0g}-p/2 + 1}\left( \sqrt{d_{g}e_{ig}}\right)}{K_{\lambda_{0g}-p/2}\left( \sqrt{d_{g}e_{ig}}\right)}-\frac{2\lambda_{0g}-p}{ e_{ig}} =: b_{ig},\\ &&{\mathbb E}[\log W_{0ig} | \mathbf{x}_{i},z_{ig} = 1,u_{ig} = 1] = \log\sqrt{\frac{e_{ig}}{d_{g}}}+\left.\frac{\partial}{\partial v} \log \left\{K_{v}\left( \sqrt{d_{g}e_{ig}}\right)\right\}\right\vert_{v=\lambda_{0g}-p/2} =: c_{ig}, \end{array} $$

where $d_{g}= \omega _{0g}+\boldsymbol {\alpha }_{g}^{\prime } (\boldsymbol {\Gamma }_{g}\boldsymbol {\Phi }_{g}\boldsymbol {\Gamma }^{\prime }_{g})^{-1} \boldsymbol {\alpha }_{g}$ and $e_{ig}= \omega _{0g}+\delta (\mathbf {x}_{i}, \boldsymbol {\mu }_{g} \mid \boldsymbol {\Gamma }_{g}\boldsymbol {\Phi }_{g}\boldsymbol {\Gamma }^{\prime }_{g})$.

The maximization of $\mathcal {Q}$ also requires the expected values of the multidimensional weight variables Δ_wig, $\boldsymbol {\Delta }_{\mathbf {w} ig}^{-1}$, and log Δ_wig. Given the density in Eq. 17, it follows that

$$W_{ijg} | \mathbf{x}_{i}, z_{ig} = 1, u_{ig} = 0\!\backsim\!\text{GIG}\left( \omega_{jg} + \alpha_{jg}^{2}{\Phi}_{jg}^{-1},\omega_{jg} + ({ \left[\boldsymbol{\Gamma}^{\prime}\mathbf{x}\right]_{j}} - \boldsymbol{\mu}_{gj} )^{2}/\phi_{jg}, \lambda_{jg} - 1/2\right).$$

Each multidimensional weight variable is replaced by its expected value and so we need to compute E_1ig = diag{E_1i1g, … , E_1ipg}, E_2ig = diag{E_2i1g, … , E_2ipg}, and E_3ig = diag{E_3i1g, … , E_3ipg}, where

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}[W_{ijg}\mid\mathbf{x}_{i},z_{ig}= 1,u_{ig}= 0] = \sqrt{\frac{\bar{e}_{ijg}}{\bar{d}_{jg}}}\frac{K_{\lambda_{jg}+ 1/2}\left( \sqrt{\bar{d}_{jg} \bar{e}_{ijg}}\right)}{K_{\lambda_{jg}-1/2}\left( \sqrt{\bar{d}_{jg} \bar{e}_{ijg}}\right)}=: E_{1ijg},\\ &&\mathbb{E}[W_{ijg}^{-1}\mid\mathbf{x}_{i},z_{ig}= 1,u_{ig}= 0] = \sqrt{\frac{\bar{d}_{jg}}{\bar{e}_{ijg}}}\frac{K_{\lambda_{jg}+ 1/2}\left( \sqrt{\bar{d}_{jg} \bar{e}_{ijg}}\right)}{K_{\lambda_{jg}-1/2}\left( \sqrt{\bar{d}_{jg} \bar{e}_{ijg}}\right)} -\frac{2\lambda_{jg}-1}{\bar{e}_{ijg}}=: E_{2ijg},\\ &&\mathbb{E}[\log W_{ijg}\mid\mathbf{x}_{i},z_{ig}= 1,u_{ig}= 0]\\ &&= \log \sqrt{\frac{\bar{e}_{ijg}}{\bar{d}_{jg}}}+ \frac{\partial}{\partial v} \left. \log \left\{ K_{v}\left( \sqrt{\bar{d}_{jg} \bar{e}_{ijg}}\right)\right\} \right\vert_{v=\lambda_{jg}-1/2} =: E_{3ijg}, \end{array} $$

(23)

$\bar {d}_{jg} = \omega _{jg} + \alpha _{jg}^{2}{\Phi }_{jg}^{-1}$ and $\bar {e}_{ijg} =\omega _{jg} + ([\mathbf {x}_{i} -\boldsymbol {\mu }_{g}]_{j} )^{2}/\phi _{jg}$. Let $n_{g}={\sum }_{i = 1}^{n} \hat z_{ig}$, $A_{g}=(1/n_{g}){\sum }_{i = 1}^{n} \hat z_{ig}a_{ig}$, $B_{g}=(1/n_{g}){\sum }_{i = 1}^{n} \hat z_{ig}b_{ig}$, $C_{g}=(1/n_{g}){\sum }_{i = 1}^{n} \hat z_{ig}c_{ig}$, ${\bar {E}}_{1jg}=(1/n_{g}){\sum }_{i = 1}^{n} \hat z_{ig}{ E}_{1ijg}$, ${\bar {E}}_{2jg}=(1/n_{g}){\sum }_{i = 1}^{n} \hat z_{ig}{ E}_{2ijg}$, and ${\bar {E}}_{3jg}=(1/n_{g}){\sum }_{i = 1}^{n} \hat z_{ig} { E}_{3ijg}$.

In the M-step, we maximize the expected value of the complete-data log-likelihood with respect to the model parameters. The mixing proportions and inner mixing proportions are updated via $\hat {\pi }_{g}=n_{g}/n$ and $\hat {\varpi }_{g}={{\sum }_{i = 1}^{n} \hat {u}_{ig} \hat {z}_{ig}}/{n_{g}}$, respectively. The elements of the location parameter μ_g and skewness parameter α_g are replaced with

$$ \begin{array}{@{}rcl@{}} \hat{\mu}_{jg} = \frac{ {\sum}_{i = 1}^{n} \hat{z}_{ig}[\boldsymbol{\Gamma}_{g}^{\prime} \mathbf{x}_{i}]_{j}(\bar{s}_{1jg} s_{2ijg}-1)}{{\sum}_{i = 1}^{n} \hat{z}_{ig} (\bar{s}_{1jg}s_{2ijg}-1)} \quad\text{and}\quad \hat{\alpha}_{jg} = \frac{ {\sum}_{i = 1}^{n} \hat{z}_{ig}[\boldsymbol{\Gamma}_{g}^{\prime} \mathbf{x}_{i}]_{j}(\bar{s}_{2jg}- s_{2ijg})}{{\sum}_{i = 1}^{n} \hat{z}_{ig} (\bar{s}_{1jg}s_{2ijg}-1)}, \end{array} $$

respectively, where $[\boldsymbol {\Gamma }_{g}^{\prime } \mathbf {x}_{i}]_{j}$ is the j th element of the matrix $\boldsymbol {\Gamma }_{g}^{\prime } \mathbf {x}_{i}$, $s_{1ijg}= \hat {u}_{ig}a_{ig}+\left (1- \hat {u}_{ig} \right ) { E}_{1ijg}$, $s_{2ijg} =\hat {u}_{ig}b_{ig}+\left (1- \hat {u}_{ig} \right ) { E}_{2ijg}$, $\bar {s}_{1jg}= 1/n_{g}{\sum }_{i = 1}^{n} \hat {z}_{ig}s_{1ijg},\bar {s}_{2jg}= 1/n_{g}{\sum }_{i = 1}^{n} \hat {z}_{ig}s_{2ijg}$. The diagonal elements of the matrix Φ_g are updated using

$$ \begin{array}{@{}rcl@{}} \hat{\phi}_{jg} &=& \frac{1}{n_{g}} \sum\limits_{i = 1}^{n} \left\{ \hat{z}_{ig} \hat{u}_{ig} \left[ b_{ig} \left( [\boldsymbol{\Gamma}_{g}^{\prime} \mathbf{x}_{i}]_{j} - \hat{\mu}_{jg} \right)^{2} -2 \left( [\boldsymbol{\Gamma}_{g}^{\prime} \mathbf{x}_{i}]_{j} - \hat{\mu}_{jg} \right) \hat{\alpha}_{jg} + a_{ig} \hat{ \alpha}_{jg}^{2} \right] \right. \\ & & \left. + \hat{z}_{ig}(1- \hat{u}_{ig} ) \left[ E_{2ijg} \left( [\boldsymbol{\Gamma}_{g}^{\prime} \mathbf{x}_{i}]_{j} - \hat{\mu}_{jg} \right)^{2}-2 \left( [\boldsymbol{\Gamma}_{g}^{\prime} \mathbf{x}_{i}]_{j} - \hat{\mu}_{jg} \right) \hat{\alpha}_{jg} + E_{1ijg} \hat{\alpha}_{jg}^{2} \right] \right\}. \end{array} $$

To update the component eigenvector matrices Γ_g, we wish to minimize the objective function

$$ \begin{array}{@{}rcl@{}} f&(\boldsymbol{\Gamma}_{g}) = -\frac{1}{2} \text{tr} \left\{ \hat{z}_{ig} \hat{\boldsymbol{\Phi}}_{g}^{-1}\mathbf V_{ig} \boldsymbol{\Gamma}_{g} \mathbf{x}_{i} \mathbf{x}_{i} \boldsymbol{\Gamma}_{g}^{\prime} \right\} + \text{tr} \left\{ \hat{z}_{ig} \mathbf{x}_{i} \left( \mathbf V_{ig} \hat{\boldsymbol{\mu}}_{g} + \hat{\boldsymbol{\alpha}}_{g} \right)' \hat{\boldsymbol{\Phi}}_{g}^{-1} \boldsymbol{\Gamma}_{g} \right\} + C \end{array} $$

(24)

with respect to Γ_g, where $\mathbf V_{ig} = \hat {u}_{ig} b_{ig}\mathbf {I}_{p} + (1-\hat {u}_{ig})\textbf {E}_{2ig}$. We employ an optimization routine that uses two simpler majorization-minimization algorithms. Our optimization routine exploits the convexity of the objective function in Eq. 24, providing a computationally stable algorithm for estimating Γ_g. Specifically, we follow Kiers (2002) and Browne and McNicholas (2014) and use the surrogate function

$$ \begin{array}{@{}rcl@{}} f(\boldsymbol{\Gamma}_{g})\leq C+\sum\limits_{i = 1}^{n}{ \text{tr}{\left\{\mathbf{F}_{rg}\boldsymbol{\Gamma}_{g}\right\}}}, \end{array} $$

(25)

where C is a constant that does not depend on Γ_g, r ∈ {1, 2} is an index, and the matrices F_rg are defined in Eqs. 26 and 27.

Therefore, on each M-step, we calculate either

$$ \mathbf{F}_{1g} = \sum\limits_{i = 1}^{n} \hat{z}_{ig}\left[ - \mathbf{x}_{i} \left( \mathbf V_{ig} \hat{\boldsymbol{\mu}}_{g} + \hat{\boldsymbol{\alpha}}_{g} \right)' \hat{\boldsymbol{\Phi}}_{g}^{-1} + \mathbf{x}_{i} \mathbf{x}_{i}^{\prime} \boldsymbol{\Gamma}_{g}^{\prime} \hat{\boldsymbol{\Phi}}_{g}^{-1}\mathbf V_{ig} - \alpha_{1ig} \mathbf{x}_{i} \mathbf{x}_{i}^{\prime} \boldsymbol{\Gamma}_{g}^{\prime} \right] $$

(26)

or

$$ \mathbf{F}_{2g} = \sum\limits_{i = 1}^{n} \hat{z}_{ig}\left[ -\mathbf{x}_{i} \left( \mathbf V_{ig} \hat{\boldsymbol{\mu}}_{g} + \hat{\boldsymbol{\alpha}}_{g} \right)' \hat{\boldsymbol{\Phi}}_{g}^{-1} + \mathbf{x}_{i} \mathbf{x}_{i}^{\prime} \boldsymbol{\Gamma}_{g}^{\prime} \hat{\boldsymbol{\Phi}}_{g}^{-1}\mathbf V_{ig} - \alpha_{2ig} \mathbf V_{ig} \hat{\boldsymbol{\Phi}}_{g}^{-1} \boldsymbol{\Gamma}_{g}^{\prime} \right], $$

(27)

where α_1ig is the largest eigenvalue of the diagonal matrix $ \boldsymbol {\Phi }_{g}^{-1}\mathbf V_{ig}$, and α_2ig is equal to $\hat {z}_{ig}\mathbf {x}_{i}^{\prime } \mathbf {x}_{i}$, which is the largest eigenvalue of the rank-1 matrix $\hat {z}_{ig}\mathbf {x}_{i} \mathbf {x}_{i}^{\prime }$. Following this, we compute the singular value decomposition of F_rg given by

$$\mathbf{F}_{rg} = \mathbf{P}\mathbf{B}\mathbf{R}^{\prime}.$$

It follows that our update for Γ_g is given by

$$\hat{\boldsymbol{\Gamma}}_{g} = \mathbf{R}\mathbf{P}^{\prime}.$$

The p-dimensional concentration and index parameters, i.e., ω_g and λ_g, are estimated by maximizing the function

$$ q_{jg}(\omega_{jg}, \lambda_{jg})=-\log K_{\lambda_{jg}}(\omega_{jg})+(\lambda_{jg} -1){\bar{E}}_{3jg}- \frac{\omega_{jg}}{2}({\bar{E}}_{1jg}+{\bar{E}}_{2jg}). $$

(28)

This leads to

$$ \hat{\lambda}_{jg}= {\bar{E}}_{3jg}\lambda_{jg}^{\text{prev}}\left[\left.\frac{\partial} {\partial v}\log K_{v}(\omega_{jg}^{\text{prev}})\right\vert_{v=\lambda_{jg}^{\text{prev}}}\right]^{-1} $$

and

$$ \hat{\omega}_{jg}= \omega_{jg}^{\text{prev}}-\left[\left.\frac{\partial} {\partial v}q_{jg}(v, {\hat\lambda_{jg}})\right\vert_{v=\omega_{jg}^{\text{prev}}}\right]\left[\left.\frac{\partial^{2}} {\partial v^{2}}q_{jg}(v, {\hat\lambda_{jg}})\right\vert_{v=\omega_{jg}^{\text{prev}}}\right]^{-1}, $$

where the superscript “prev” denotes that the estimate from the previous iteration is used. The univariate parameters ω_0g and λ_0g are estimated by maximizing the function

$$ \begin{array}{@{}rcl@{}} q_{0g}(\omega_{0g}, \lambda_{0g})=-\log(K_{\lambda_{0g}}(\omega_{0g}))+(\lambda_{0g} -1)C_{g}- \frac{\omega_{0g}}{2}(A_{g}+B_{g}), \end{array} $$

(29)

giving

$$ \hat\lambda_{0g}= C_{g}\lambda_{0g}^{\text{prev}}\left[\left.\frac{\partial} {\partial v}\log K_{v}(\omega_{0g}^{\text{prev}})\right\vert_{v=\lambda_{0g}^{\text{prev}}}\right]^{-1}\qquad $$

and

$$ \hat \omega_{0g}= \omega_{0g}^{\text{prev}}-\left[\left.\frac{\partial}{\partial v}q_{0g}(v, {\hat\lambda_{0g}})\right\vert_{v=\omega_{0g}^{\text{prev}}}\right]\left[\left.\frac{\partial^{2}} {\partial v^{2}}q_{0g}(v, {\hat\lambda_{0g}})\right\vert_{v=\omega_{0g}^{\text{prev}}}\right]^{-1}. $$

Our GEM algorithm is iterated until convergence, which is determined using the Aitken acceleration (Aitken 1926). Formally, the Aitken acceleration is given by

$$ \begin{array}{@{}rcl@{}} a^{(k)}=\frac{l^{(k + 1)}-l^{(k)}}{l^{(k)}-l^{(k-1)}}, \end{array} $$

where l^(k) is the value of the log-likelihood at the iteration k and

$$ \begin{array}{@{}rcl@{}} l^{(k + 1)}_{\infty}=l^{(k)}+\frac{1}{1-a^{(k)}}\left( l^{(k + 1)}-l^{(k)}\right), \end{array} $$

is an asymptotic estimate of the log-likelihood on iteration k + 1. The algorithm can be considered to have converged when $l^{(k + 1)}_{\infty }-l^{(k)}< \epsilon $, provided this difference is positive (Böhning et al. 1994; Lindsay 1995; McNicholas et al. 2010). Herein, we set 𝜖 = 0.01. When the algorithm converges, we compute the maximum a posteriori (MAP) classification values using the posterior $\hat {z}_{ig}$, where $\text {MAP}\left \{\hat {z}_{ig}\right \}= 1$ if $g=\arg \max _{h}\left \{\hat {z}_{ih}\right \}$, and $\text {MAP}\left \{\hat {z}_{ig}\right \}= 0$ otherwise.

Appendix 2: Quasi-Concavity of the cMSGHD

In essence, we might want to consider only densities whose contours contain a set of points that are convex. Formally, such densities are quasi-concave. Extensive details on quasi-concavity, quasi-convexity, and related notions are given by Niculescu and Persson (2006) and Rockafellar and Wets (2009).

Definition 1

A function f(x) is quasi-concave if each upper-level set U_α(f) = {x|f(x) ≥ α} is convex, for $\alpha \in \mathbb {R}$.

Definition 2

A function f(x) is quasi-convex if each sub-level set S_α(f) = {x|f(x) ≤ α} is convex, for $\alpha \in \mathbb {R}$.

Lemma 1

The class of elliptical distributions, whose density functions have theform

$$f(\mathbf{x}) = \frac{1}{\sqrt{\lvert\boldsymbol{\Sigma}\rvert}}g\left( \delta\left( \mathbf{x}, \boldsymbol{\mu} | \boldsymbol{\Sigma}\right) \right)$$

are quasi-concave if the generator function, g, is monotonic non-increasing.

Proof

Result follows from the fact that if the function δ (x, μ|Σ) is convex since Σ is positive definite and the function g is monotonic non-increasing, then the function f(x) is quasi-concave. □

Theorem 1

The generalized hyperbolic distribution (GHD) is quasi-concave.

Proof

It is straightforward to show that the function

$$h(\mathbf{x}) = \sqrt{ a + b \delta\left( \mathbf{x}, \boldsymbol{\mu} | \boldsymbol{\Sigma}\right)}$$

is convex, where a and b are positive constants, and δ (x, μ|Σ) is the Malahanobis distance between x and μ. Let τ = λ − p/2. Then, the function

$$k(z) = \tau \log z + \log K_{\tau}(z),$$

where $z\in \mathbb {R}^{+}$, and K_τ is the modified Bessel function of the third kind with index τ, is monotonic decreasing (or non-increasing) because the first derivative

$$k^{\prime}(z) = \frac{\tau}{z} + \frac{ (\tau/z) K_{\tau}(z) - K_{\tau+ 1}(z)} {K_{\tau}(z)} = \frac{2\tau}{z} - \frac{ K_{\tau+ 1}(z)} {K_{\tau}(z)} = -\frac{K_{\tau-1}(z)}{K_{\tau}(z)}$$

is negative for all $\tau \in \mathbb {R}$ and z > 0. In addition to being monotonic decreasing, k(z) is convex for τ < 1/2, concave and convex (linear) for τ = 1/2, and concave for τ > 1/2. Because k(z) is a monotonic function, it satisfies the criteria for quasi-convexity and quasi-concavity, so it is simultaneously quasi-convex and quasi-concave. In this context, monotone functions are also known as quasi-linear or quasi-montone.

Recall that if the function U is quasi-convex and the function g is decreasing, then the function f(x) = g(U(x)) is quasi-concave. It follows that the composition k(h(x)) is quasi-concave. Consider the skewness part of the GHD density function, i.e., a(x) = −(x −μ)′Σ^− 1α, which is a linear function. It follows that the function

$$ \exp\left\{k(h(\mathbf{x}))+a(\mathbf{x})\right\} $$

(30)

is also quasi-concave, and the result follows from the fact that Eq. 30 is proportional to the density of the GHD. □

Theorem 2

The convex multiple scaled generalized hyperbolic distribution(cMSGHD) is quasi-concave. In other words, the multiple scaledgeneralized hyperbolic distribution (MSGHD) is quasi-concave provided thatλ_j > 1 for allj = 1, … , p.

Proof

A p-dimensional multiple scaled distribution is a product of p independent univariate densities. The density of the MSGHD has form

$$g_{p}(x_{1},x_{2},\ldots,x_{p}) = g_{1}(x_{1} | \boldsymbol{\theta}_{1}) g_{1}(x_{2} | \boldsymbol{\theta}_{2})\times\cdots\times g_{1}(x_{p} | \boldsymbol{\theta}_{p}),$$

where g₁(x_j|θ_j) is the density of the univariate hyperbolic distribution with parameters θ_j, j = 1, … , p. From Theorem 1, log g₁(x_j|θ_j) is a concave function for τ_j > 1/2, i.e., for λ_j > 1 (because p = 1). Therefore, the function

$$\log g_{p}(x_{1},x_{2},\ldots,x_{p}) = \log g_{1}(x_{1} | \boldsymbol{\theta}_{1}) + \log g_{1}(x_{2} | \boldsymbol{\theta}_{2})+\cdots+ \log g_{1}(x_{p} | \boldsymbol{\theta}_{p})$$

is concave provided that λ_j > 1 for all j = 1, … , p. Therefore, the function

$$g_{p}(x_{1},x_{2},\ldots,x_{p}) = g_{1}(x_{1} | \boldsymbol{\theta}_{1}) g_{1}(x_{2} | \boldsymbol{\theta}_{2})\times\cdots\times g_{1}(x_{p} | \boldsymbol{\theta}_{p})$$

is quasi-concave provided that λ_j > 1 for all j = 1, … , p. □

Note that addition does not preserve quasi-convexity or quasi-concavity. The sum of two quasi-convex functions defined on different domains will be quasi-concave if they are additively decomposed (see Debreu and Koopmans 1982). Debreu and Koopmans (1982) give necessary and sufficient conditions for the sum f of a set of functions f₁, … , f_m to be additively decomposed. These conditions depend on the convexity index c(f) in which f is quasi convex if and only if either of the following hold: (i) c(f_i) ≥ 0 for every i, or (ii) c(f_j) < 0 for some j, c(f_i) > 0 for every i ≠ j, and ${\sum }_{i = 1}^{m} \frac {1}{c(f_{i})} \le 0$. For differentiable functions, the convexity index satisfies the inequality f^″(x)/[f′(x)]² ≥ c(f).

We have that a sufficient condition for the MSGHD to be quasi-concave is that all λ_j > 1. Furthermore, a sufficient condition for the MSGHD not to be quasi-concave is that all λ_j < 1 and finite. Interestingly, this means the multiple scaled t-distribution cannot provide convex level sets for any finite degrees of freedom. For large degrees of freedom, the multiple scaled t-distribution will behave similarly to a normal distribution near the mode; however, as one moves away from the mode, non-convex contours will be encountered. Finally, note that Debreu and Koopmans (1982) give necessary and sufficient conditions that suggest a quasi-concave MSGHD with some λ_j positive and others negative is possible, but going this route would greatly complicate the estimation procedure.

Appendix 3: Finite Mixture Identifiability

In this section, we consider the notion of identifiability for finite mixtures of MSGHDs and coalesced generalized hyperbolic distributions (CGHDs). Herein, we take the term identifiability to mean finite mixture identifiability.

3.1 3.1 Background

Holzmann et al. (2006) prove identifiability of finite mixtures of elliptical distributions. They state that “finite mixtures are said to be identifiable if distinct mixing distributions with finite support correspond to distinct mixtures.” A finite mixture of the densities f_p(x|Ψ₁), … , f_p(x|Ψ_G) is identifiable if the family $\left \{ f_{p}(\mathbf {x}|\boldsymbol {\Psi }) : \boldsymbol {\Psi } \in \mathcal {A}^{p} \right \}$ is linearly independent. The founding work on finite mixture identifiability is by Yakowitz and Spragins (1968), who state that this linear independence is a necessary and sufficient condition for identifiability.

The GHD can be expressed as a normal variance-mean mixture. The stochastic relationship of the normal variance-mean mixture is given by

$$ \mathbf{X} =\boldsymbol{\mu} + W\boldsymbol{\alpha}+ \sqrt{W} \mathbf{U}, $$

(31)

where $\mathbf {U} \backsim \mathcal {N}_{p}(\mathbf {0}, \boldsymbol {\Sigma })$ and W, independent of U, is a positive univariate random variable with density h(w|θ). Browne and McNicholas (2015) proved identifiability for finite mixtures of GHDs through additivity of disjoint sets of identifiable distributions.

Definition 3

In the present context, a finite mixture of the multiple scale distributions f(x|θ₁), … , f(x|θ_G) is identifiable if

$$ \sum\limits_{g = 1}^{G} \pi_{g} f\left( {\mathbf{x}}|\boldsymbol{\theta}_{g} \right) = \sum\limits_{g = 1}^{G} \pi_{g}^{\star} f\left( {\mathbf{x}}|\boldsymbol{\theta}_{g}^{\star} \right) $$

(32)

for $\mathbf {x} \in \mathbb {R}^{p}$, where G is a positive integer, ${\sum }_{g = 1}^{G} \pi _{g} = {\sum }_{g = 1}^{G} \pi _{g}^{\star } = 1$ and $\pi _{g}, \pi _{g}^{\star } > 0$ for g = 1, … , G, implies that there exists a permutation σ such that (π_g, θ_g) = (π_σ(g), θ_σ(g)) for all g.

Browne and McNicholas (2015) prove identifiability for normal variance-mean mixtures, which includes the generalized hyperbolic. Here, we view the results from a different vantage point to illustrate the concepts required for the identifiability of the multiple scaled distributions. We begin by noting the characteristic function for the generalized hyperbolic arises from the characteristic function of the normal variance-mean mixture,

$$ \varphi_{\mathbf{X}}(\mathbf{v}) = \exp \left\{ i \mathbf{v}^{\prime}\boldsymbol{\mu}_{g} \right\} M_{W} \left( \boldsymbol{\beta}_{g}^{\prime} \mathbf{v} i -\frac{1}{2} \mathbf{v}^{\prime} \boldsymbol{\Sigma}_{g} \mathbf{v} \left| \boldsymbol{\Gamma}_{g} \right.\right), $$

(33)

where

$$ M_{W} \left( u \right) = \left[ \frac{\omega}{\omega -2u} \right]^{\frac{\lambda}{2}} \frac{ K_{\lambda} \left( \sqrt{ \omega(\omega-2u)} \right)} { K_{\lambda} \left( \omega \right)} = \left[ 1 -2 \frac{u}{ \omega} \right]^{- \frac{\lambda}{2}} \frac{ K_{\lambda} \left( \sqrt{ \omega(\omega-2u)} \right)} { K_{\lambda} \left( \omega \right)} . $$

The characteristic function for the generalized hyperbolic is

$$ \varphi_{\mathbf{X}}(\mathbf{v} ) = \exp\{ i \mathbf{v}^{\prime}\boldsymbol{\mu}\} \left[ 1 + \frac{ \mathbf{v}^{\prime} \boldsymbol{\Sigma} \mathbf{v} -2 i \boldsymbol{\beta}^{\prime} \mathbf{v} } {\omega} \right]^{-\frac{\lambda}{2}} \frac{ K_{\lambda} \left( \sqrt{ \omega \left[\omega + (\mathbf{v}^{\prime} \boldsymbol{\Sigma} \mathbf{v} - 2 i \boldsymbol{\beta}^{\prime} \mathbf{v} ) \right]} \right)} { K_{\lambda} \left( \omega \right)} . $$

In the context of a coalesced distribution, with a eigen-decomposed scale matrix, the characteristic function is

$$ \varphi_{\mathbf{X}}(\mathbf{v} ) = \exp\{ i \mathbf{v}^{\prime}\boldsymbol{\mu}\} \left[ 1 + \frac{ \mathbf{v}^{\prime} \boldsymbol{\Gamma} \boldsymbol{\Phi} \boldsymbol{\Gamma}^{\prime} \mathbf{v} -2 i \boldsymbol{\beta}^{\prime} \mathbf{v} } {\omega} \right]^{ -\frac{\lambda}{2}} \frac{ K_{\lambda} \left( \sqrt{ \omega \left[\omega + (\mathbf{v}^{\prime} \boldsymbol{\Gamma} \boldsymbol{\Phi} \boldsymbol{\Gamma}^{\prime} \mathbf{v} - 2 i \boldsymbol{\beta}^{\prime} \mathbf{v} ) \right]} \right)} { K_{\lambda} \left( \omega \right)} . $$

Now, we let v = tz and obtain

$$ \begin{array}{@{}rcl@{}} \varphi_{\mathbf{X}}(\mathbf{v} = t \mathbf{z}) &=& \exp\{ i t \mathbf{z}^{\prime}\boldsymbol{\mu}\} \left[ 1 + \frac{ t^{2} (\mathbf{z}^{\prime} \boldsymbol{\Gamma} \boldsymbol{\Phi} \boldsymbol{\Gamma}^{\prime} \mathbf{z}) -2 i t (\boldsymbol{\beta}^{\prime} \mathbf{z}) } {\omega} \right]^{-\frac{\lambda}{2}}\\&&\times \frac{ K_{\lambda} \left( \sqrt{ \omega \left[\omega + t^{2} (\mathbf{z}^{\prime} \boldsymbol{\Gamma} \boldsymbol{\Phi} \boldsymbol{\Gamma}^{\prime} \mathbf{z}) - 2 i t (\boldsymbol{\beta}^{\prime} \mathbf{z} ) \right]} \right)} { K_{\lambda} \left( \omega \right)} . \end{array} $$

To prove identifiability of the generalized hyperbolic, we could now use the results from Browne and McNicholas (2015) and Yakowitz and Spragins (1968, p. 211) that implies there exists z such that the tuple $(\mathbf {z}^{\prime } \boldsymbol {\Sigma }_{g} \mathbf {z}, \boldsymbol {\beta }_{g}^{\prime } \mathbf {z}, \mathbf {z}^{\prime }\boldsymbol {\mu }_{g})$, where $\boldsymbol {\Sigma }_{g}= \boldsymbol {\Gamma }_{g} \boldsymbol {\Phi }_{g} \boldsymbol {\Gamma }_{g}^{\prime }$ is unique for all g = 1, … , G, allows a reduction to the univariate case. Now, we rewrite the term z′Σ_gz as

$$ \mathbf{z}^{\prime} \boldsymbol{\Sigma}_{g} \mathbf{z} = \mathbf{z}^{\prime} \boldsymbol{\Gamma}_{g} \boldsymbol{\Phi}_{g} \boldsymbol{\Gamma}_{g}^{\prime} \mathbf{z} = \text{tr}\left[ \mathbf{z}^{\prime} \boldsymbol{\Gamma}_{g} \boldsymbol{\Phi}_{g} \boldsymbol{\Gamma}_{g}^{\prime} \mathbf{z} \right] = \text{tr}\left[ \boldsymbol{\Gamma}_{g}^{\prime} \mathbf{z} \mathbf{z}^{\prime} \boldsymbol{\Gamma}_{g} \boldsymbol{\Phi}_{g} \right] = \sum\limits_{j = 1}^{p} {\Phi}_{jg} [\boldsymbol{\Gamma}_{g}^{\prime} \mathbf{z} ]_{j}^{2}, $$

which implies the tuple

$$ \left( \mathbf{z}^{\prime} \boldsymbol{\Gamma}_{g} \boldsymbol{\Phi}_{g} \boldsymbol{\Gamma}_{g}^{\prime} \mathbf{z}, \boldsymbol{\beta}_{g}^{\prime} \mathbf{z}, \mathbf{z}^{\prime}\boldsymbol{\mu}_{g}\right) \equiv \left( \sum\limits_{j = 1}^{p} {\Phi}_{jg} [\boldsymbol{\Gamma}_{g}^{\prime} \mathbf{z} ]_{j}^{2}, \boldsymbol{\beta}_{g}^{\prime} \mathbf{z}, \mathbf{z}^{\prime}\boldsymbol{\mu}_{g}\right) $$

is unique for all g = 1, … , G. A similar argument indicates there exists a z such that the tuple

$$ \left( \sum\limits_{j = 1}^{p} {\Phi}_{jg} | {[\boldsymbol{\Gamma}_{g}^{\prime} \mathbf{z} ]_{j}} | , \boldsymbol{\beta}_{g}^{\prime} \mathbf{z}, \mathbf{z}^{\prime}\boldsymbol{\mu}_{g}\right) $$

(34)

is unique. In fact, a more general statement indicates that there exists a z such that the tuple

$$ \left( \sum\limits_{j = 1}^{p} {\Phi}_{jg} \varphi({[\boldsymbol{\Gamma}_{g}^{\prime} \mathbf{z} ]_{j}^{2}} ) , \boldsymbol{\beta}_{g}^{\prime} \mathbf{z}, \mathbf{z}^{\prime}\boldsymbol{\mu}_{g}\right) $$

is unique for monotonic $\varphi : \mathbb {R}^{+} \mapsto \mathbb {R}^{+} $. Deriving this unique set of tuples facilitates the reduction to the univariate case. This is useful because the univariate generalized hyperbolic density is identifiable (see Browne and McNicholas2015).

3.2 3.2 Identifiability of a Finite Mixture of Multiple Scaled Distributions

For a multiple scaled distribution, we only need to find a single direction where the distribution is finite mixture identifiable because, as noted in Remark 2 of Kent (1983), a distribution might be non-identifiable on a subset of $\mathbb {R}^{p}$ but identifiability can endure over $\mathbb {R}^{p}$. In other words, for a distribution to be non-identifiable, a linear combination has to be equal to zero for all $x \in \mathbb {R}^{p}$. This is illustrated by the example given in Kent (1983):

“the polynomials P(x₁, x₂) = 1 and $P(x_{1}, x_{2}) = ({x_{1}^{2}}+ x_{2})^{3}$, $x\in \mathbb {R}^{2}$, are equal on the unit circle, but are not the same on all of $\mathbb {R}^{2}$.”

As a consequence, if a multivariate distribution is identifiable in some direction then it is identifiable over $\mathbb {R}^{p}$.

To begin, consider that if there is at one least direction or column of Γ_g that is equal across g = 1, … , G, then the identifiability of a multiple scaled distribution follows from the identifiability of the univariate distribution. Whereas if one column of Γ_g is unequal, that implies, by the nature of orthonormal matrices, that two columns of Γ_g are unequal. We will now illustrate how the bivariate multiple scaled distribution is identifiable, which implies identifiability for finite p.

When Γ_g differ, the identifiability of the multiple scaled distribution depends on the behavior of the multiple scaled distribution’s density and moment generating functions when we consider moving along directions other than the columns of Γ_g. For example, a bivariate multiple scaled t-distribution behaves (by definition) like a t-distribution with ν₁ and ν₂ degrees of freedom along each of it’s principal axes, but along any other direction, a bivariate multiple scaled t-distribution behaves asymptotically like a t-distribution with ν₁ + ν₂ degrees of freedom.

Consider the following three orthonormal matrices in the context of an eigen-decomposition of a matrix;

$$ \boldsymbol{\Gamma}_{1} = \left[\begin{array}{cc} 1 & 0 \\0 & 1 \end{array}\right], \quad\quad \ \quad\quad \boldsymbol{\Gamma}_{2} = \left[\begin{array}{cc} 0 & 1 \\1 & 0 \end{array}\right] \quad\quad \text{and} \quad\quad \boldsymbol{\Gamma}_{3} = \left[\begin{array}{cc} -1 & 0 \\ 0 & 1 \end{array}\right] . $$

If we have equal eigenvalues then we cannot distinguish between Γ₁ and Γ₂. In the same way, if we have the same distribution along the first and second axis, we cannot distinguish between them. However, if we have eigenvalue ordering we can distinguish between Γ₁ and Γ₂, but eigenvalue ordering will not allow us to distinguish between Γ₁ and Γ₃, since they yield the same basis or set of directions. Therefore, in general, Γ is unique up to multiplication by

$$ \left[\begin{array}{cc} \pm 1 & 0 \\0 & \pm 1 \end{array}\right] . $$

One way to establish uniqueness is to require the largest value of each column of Γ to be positive. An equivalent requirement is for Γ₁ ≠ Γ₂ which requires that

$$ \boldsymbol{\Gamma}_{1}^{\prime} \boldsymbol{\Gamma}_{2} \neq \mathbf{R} \quad\quad \text{or} \quad\quad [ \boldsymbol{\Gamma}_{1}^{\prime}\mathbf{z} ]_{j} \neq - [ \boldsymbol{\Gamma}_{2}^{\prime} \mathbf{z} ]_{j} $$

(35)

for j = 1, … , p, $\mathbf {z} \in \mathbb {R}^{P}$, z ≠ 0_p and R is a set of diagonal matrices such that diag(R) = (± 1, … , ± 1) excluding the identity matrix. Note that [a]_j denotes the j th element of the vector a. However, if we had two orthonormal matrices such that $\boldsymbol {\Gamma }_{1}^{\prime } \boldsymbol {\Gamma }_{2} = \mathbf {I}$, then Γ₁ = Γ₂. If $\boldsymbol {\Gamma }_{1}^{\prime } \boldsymbol {\Gamma }_{2} = \mathbf {R}$, then our orthonormal condition amounts to $ \boldsymbol {\Gamma }_{1}^{\prime } = \boldsymbol {\Gamma }_{2}$ or equivalently, for all directions $\mathbf {z} \in \mathbb {R}^{P}$ and z ≠ 0_p

$$ | [ \boldsymbol{\Gamma}_{1}^{\prime}\mathbf{z} ]_{j} | = | [ \boldsymbol{\Gamma}_{2}^{\prime} \mathbf{z} ]_{j} | \quad \text{for all} \quad j = 1,\ldots,p \quad \text{then} \quad \boldsymbol{\Gamma}_{1} = \boldsymbol{\Gamma}_{2}. $$

(36)

This prevents the j th column of Γ₂ from being in the opposite direction of the j th column of Γ₁. This form of the condition is easier to incorporate into the identifiability illustration.

In the MSGHD, if we consider moving the amount t in a direction z, which entails setting x = tz, we can write the density as

$$ \begin{array}{@{}rcl@{}} && f_{\text{MSGHD}}\left( \mathbf{x} = t \mathbf{z} \mid\boldsymbol{\mu},\boldsymbol{\Gamma},\boldsymbol{\Phi},\boldsymbol{\alpha},\boldsymbol{\omega},\boldsymbol{\lambda}\right) \\ &&=\prod\limits_{j = 1}^{p}\left[\frac{\omega_{j}+ {\Phi}_{j}^{-1}\left( t \left[\boldsymbol{\Gamma}^{\prime} \mathbf{z}\right]_{j}-\mu_{j}\right)^{2}}{\omega_{j}+ {\alpha_{j}^{2}} {{\Phi}_{j}}^{-1}} \right]^{\frac{\lambda_{j}-\frac{1}{2}}{2}}\\&&\times \frac{K_{\lambda_{j}-\frac{1}{2}}\left( \sqrt {[\omega_{j}+{\alpha_{j}^{2}} {{\Phi}_{j}}^{-1}]\left[\omega_{j}+ {\Phi}_{j}^{-1}\left( t \left[\boldsymbol{\Gamma}^{\prime} \mathbf{z}\right]_{j}-\mu_{j}\right)^{2}\right]}\right)} {(2\pi)^{\frac{1}{2}}{{\Phi}_{j}}^{\frac{1}{2}}K_{\lambda_{j}}(\omega_{j})\exp{\{-(t \left[\boldsymbol{\Gamma}^{\prime} \mathbf{z}\right]_{j}-\mu_{j}){\alpha_{j}} {\Phi}_{j}^{-1}\}}}, \end{array} $$

(37)

Note, if z is equal to the k th eigenvector, which is the k th column of Γ, then the density reduces to

$$ s_{k} \left[\frac{\omega_{k}+ {\Phi}_{k}^{-1}\left( t -\mu_{k}\right)^{2}}{\omega_{k}+ {\alpha_{k}^{2}} {{\Phi}_{k}}^{-1}} \right]^{\frac{\lambda_{k}-\frac{1}{2}}{2}} \frac{K_{\lambda_{k}-\frac{1}{2}}\left( \sqrt {[\omega_{k}+{\alpha_{k}^{2}} {{\Phi}_{k}}^{-1}]\left[\omega_{k}+ {\Phi}_{k}^{-1}\left( t -\mu_{k}\right)^{2}\right]}\right)} {(2\pi)^{\frac{1}{2}}{{\Phi}_{k}}^{\frac{1}{2}}K_{\lambda_{k}}(\omega_{k})\exp{\left\{-\left( t-\mu_{k}\right){\alpha_{k}} {\Phi}_{k}^{-1} \right\}}}, $$

where

$$ s_{k} = \prod\limits_{j = 1, j\neq k}^{p}\left[\frac{\omega_{j}+ {\Phi}_{j}^{-1} {\mu_{j}^{2}}}{\omega_{j}+ {\alpha_{j}^{2}} {{\Phi}_{j}}^{-1}} \right]^{\frac{\lambda_{j}-\frac{1}{2}}{2}} \frac{K_{\lambda_{j}-\frac{1}{2}}\left( \sqrt {[\omega_{j}+{\alpha_{j}^{2}} {{\Phi}_{j}}^{-1}]\left[\omega_{j}+ {\Phi}_{j}^{-1}{\mu_{j}^{2}}\right]}\right)} {(2\pi)^{\frac{1}{2}}{{\Phi}_{j}}^{\frac{1}{2}}K_{\lambda_{j}}(\omega_{j})\exp{\left\{ \mu_{j} \alpha_{j}{\Phi}_{j}^{-1} \right\}}}. $$

Therefore, the density is simply proportional to

$$ \left[\frac{\omega_{k}+ {\Phi}_{k}^{-1}\left( t -\mu_{k}\right)^{2}}{\omega_{k}+ {\alpha_{k}^{2}} {{\Phi}_{k}}^{-1}} \right]^{\frac{\lambda_{k}-\frac{1}{2}}{2}} \frac{K_{\lambda_{k}-\frac{1}{2}}\left( \sqrt {[\omega_{k}+{\alpha_{k}^{2}} {{\Phi}_{k}}^{-1}]\left[\omega_{k}+ {\Phi}_{k}^{-1}\left( t -\mu_{k}\right)^{2}\right]}\right)} {(2\pi)^{\frac{1}{2}}{{\Phi}_{k}}^{\frac{1}{2}}K_{\lambda_{k}}(\omega_{k})\exp{\left\{-\left( t-\mu_{k}\right){\alpha_{k}} {\Phi}_{k}^{-1} \right\}}} . $$

First, note that if the parameterizations are one-to-one, then if one parameterization is shown to be identifiable, the others are identifiable as well. Similar to Browne and McNicholas (2015), we let δ_j = β_j/Φ_j, $\alpha _{j} = \sqrt { \omega _{j}/{\Phi }_{j} + {\beta _{j}^{2}}/{{\Phi }_{j}^{2}}} $ and $\kappa _{j} = \sqrt {{\Phi }_{j} \omega _{j}} $, where α_j ≥|δ_j|. Under this reparameterization, we now have

$$ {\Phi}_{j} = \frac{\kappa_{j}}{\sqrt{{\alpha_{j}^{2}}-{\delta_{j}^{2}}}}, \quad \omega_{j} = \kappa_{j}\sqrt{{\alpha_{j}^{2}}-{\delta_{j}^{2}}} \quad \text{ and} \quad\beta_{j} = \frac{\delta_{j} \kappa_{j}}{\sqrt{{\alpha_{j}^{2}}-{\delta_{j}^{2}}}}. $$

(38)

For large z, the Bessel function can approximated by

$$ K_{\lambda} (z) = \sqrt{ \frac{ \pi} {2 z}} e^{-z} \left[ 1+ O\left( \frac{1}{z}\right)\right], $$

which yields, using the alternative parameterization,

$$ \begin{array}{@{}rcl@{}} f(t \mid\boldsymbol{\theta} ) &\propto \left[ 1 + \frac{(t-\mu_{j})^{2}}{{\kappa_{j}^{2}}} \right]^{\lambda_{j}/2}\exp\left\{ - \alpha_{j} |t-\mu_{j}|+\delta_{j} \left( t-\mu_{j}\right) \right\} . \end{array} $$

(39)

If z is not equal to the k th eigenvector, than, using the reparameterization given in Eq. 38, we have

$$ \begin{array}{@{}rcl@{}} f(t \mid\boldsymbol{\theta} ) \!&\propto&\! \exp\left\{ - \sum\limits_{j = 1}^{p} \alpha_{j} \left| t \left[\boldsymbol{\Gamma}^{\prime} \mathbf{z}\right]_{j} - \mu_{j} \right| + \sum\limits_{j = 1}^{p} \delta_{j} \left( t \left[\boldsymbol{\Gamma}^{\prime} \mathbf{z}\right]_{j} - \mu_{j} \right) \right\} \prod\limits_{j = 1}^{p}\\&&\!\times\left[ 1 + \frac{\left( t \left[\boldsymbol{\Gamma}^{\prime} \mathbf{z}\right]_{j}-\mu_{j}\right)^{2}}{{\kappa_{j}^{2}}} \right]^{\frac{\lambda_{j}-\frac{1}{2}}{2}} \\ \!&\propto&\! \exp\left\{ - \sum\limits_{j = 1}^{p} \alpha_{j} \left| t \left[\boldsymbol{\Gamma}^{\prime} \mathbf{z}\right]_{j} - \mu_{j} \right| + \sum\limits_{j = 1}^{p} \delta_{j} \left( t \left[\boldsymbol{\Gamma}^{\prime} \mathbf{z}\right]_{j} - \mu_{j} \right) \right\} t^{2 \sum\limits_{j = 1}^{p} I\left( \left[\boldsymbol{\Gamma}^{\prime} \mathbf{z}\right]_{j} \neq 0 \right) \frac{\lambda_{j}-\frac{1}{2}}{2}} \\ \!&\propto&\! \exp\left\{ \sum\limits_{j = 1}^{p} \left[ \!- \alpha_{j} \left| t \left[\boldsymbol{\Gamma}^{\prime} \mathbf{z}\right]_{j} - \mu_{j} \right| + \delta_{j} \left( t \left[\boldsymbol{\Gamma}^{\prime} \mathbf{z}\right]_{j} - \mu_{j} \right) + 2 I\left( \left[\boldsymbol{\Gamma}^{\prime} \mathbf{z}\right]_{j} \!\neq\! 0 \right) \frac{\lambda_{j} - \frac{1}{2}}{2} \log (t) \right] \right\} \\ \!&\propto&\! \prod\limits_{j = 1}^{p} \exp\left\{\! - \alpha_{j} \left| t \left[\boldsymbol{\Gamma}^{\prime} \mathbf{z}\right]_{j} - \mu_{j} \right| + \delta_{j} \left( t \left[\boldsymbol{\Gamma}^{\prime} \mathbf{z}\right]_{j} - \mu_{j} \right) + 2 I\left( \left[\boldsymbol{\Gamma}^{\prime} \mathbf{z}\right]_{j} \!\neq\! 0 \right) \frac{\lambda_{j} - \frac{1}{2}}{2} \log (t) \right\}. \end{array} $$

(40)

The characteristic function for a multiple scaled distribution can be written as

$$ \begin{array}{@{}rcl@{}} \varphi_{\mathbf{X}}(\mathbf{v} ) &=& \prod\limits_{j = 1}^{P} \exp\{ i | {[\boldsymbol{\Gamma}^{\prime} \mathbf{v} ]_{j}}| \mu_{j} \} \left[ 1 + \frac{ {\Phi}_{j} | {[\boldsymbol{\Gamma}^{\prime} \mathbf{v} ]_{j}}|^{2} -2 \beta_{j} | {[\boldsymbol{\Gamma}^{\prime} \mathbf{v} ]_{j}}| i} {\omega_{j}} \right]^{ -\frac{\lambda_{j}} {2}}\\&&\times \frac{ K_{\lambda_{j}} \left( \sqrt{ \omega_{j} \left[\omega_{j} + ({\Phi}_{j} | {[\boldsymbol{\Gamma}^{\prime} \mathbf{v} ]_{j}}|^{2} - 2\beta_{j} | {[\boldsymbol{\Gamma}^{\prime} \mathbf{v} ]_{j}}| i ) \right]} \right)} { K_{\lambda_{j}} \left( \omega_{j} \right)} , \end{array} $$

which, under the alternative parameterization from Eq. 38, becomes

$$ \begin{array}{@{}rcl@{}} \varphi_{\mathbf{X}}(\mathbf{v} ) &=& \prod\limits_{j = 1}^{P} \exp\{ i | {[\boldsymbol{\Gamma}^{\prime} \mathbf{v} ]_{j}}| \mu_{j} \} \left[ 1 + \frac{ | {[\boldsymbol{\Gamma}^{\prime} \mathbf{v} ]_{j}}|^{2} - 2 \delta_{j} | {[\boldsymbol{\Gamma}^{\prime} \mathbf{v} ]_{j}}| i} {{\alpha_{j}^{2}} - {\delta_{j}^{2}}} \right]^{-\frac{\lambda_{j}}{2}}\\&&\times \frac{ K_{\lambda_{j}} \left( \sqrt{ {\kappa_{j}^{2}} \left[ | {[\boldsymbol{\Gamma}^{\prime} \mathbf{v} ]_{j}}|^{2} - 2 \delta_{j} | {[\boldsymbol{\Gamma}^{\prime} \mathbf{v} ]_{j}}| i + {\alpha_{j}^{2}} - {\delta_{j}^{2}} \right]} \right)} { K_{\lambda_{j}} \left( \kappa_{j} \sqrt{{\alpha_{j}^{2}} - {\delta_{j}^{2}}} \right)}. \end{array} $$

(41)

Now if we consider moving t in the direction z

$$ \begin{array}{@{}rcl@{}} \varphi_{\mathbf{X}}(\mathbf{v} = t \mathbf{z} ) &=& \prod\limits_{j = 1}^{P} \exp\{ i t | {[\boldsymbol{\Gamma}^{\prime} \mathbf{z} ]_{j}}| \mu_{j} \} \left[ 1 + \frac{ t^{2} | {[\boldsymbol{\Gamma}^{\prime} \mathbf{z} ]_{j}}|^{2} - 2 \delta_{j} t | {[\boldsymbol{\Gamma}^{\prime} \mathbf{z} ]_{j}}| i} {{\alpha_{j}^{2}} - {\delta_{j}^{2}}} \right]^{-\frac{\lambda_{j}}{2}}\\&&\times \frac{K_{\lambda_{j}} \left( \sqrt{ {\kappa_{j}^{2}} \left[ t^{2} | {[\boldsymbol{\Gamma}^{\prime} \mathbf{z} ]_{j}}|^{2} - 2 \delta_{j} t | {[\boldsymbol{\Gamma}^{\prime} \mathbf{z} ]_{j}}| i + {\alpha_{j}^{2}} - {\delta_{j}^{2}} \right]} \right)} { K_{\lambda_{j}} \left( \kappa_{j} \sqrt{{\alpha_{j}^{2}} - {\delta_{j}^{2}}} \right)}, \end{array} $$

and, for large t, the characteristic function is

$$ \begin{array}{@{}rcl@{}} \varphi_{\mathbf{X}}(\mathbf{v} = t \mathbf{z} ) \!&\propto&\! \exp\left\{ i t \sum\limits_{j = 1}^{P} | {[\boldsymbol{\Gamma}^{\prime} \mathbf{z}]_{j}}| \mu_{j} - t \sum\limits_{j = 1}^{P} \kappa_{j} | {[\boldsymbol{\Gamma}^{\prime} \mathbf{z} ]_{j}} | - \log(t) \sum\limits_{j = 1}^{P} \lambda_{j} I\left( |{[\boldsymbol{\Gamma}^{\prime} \mathbf{z} ]_{j}}| \!\neq\! 0 \right) + O(1) \right\} \\ \!& \propto&\! \exp\left\{ i t \mathbf{z}^{\prime} \boldsymbol{\Gamma} \boldsymbol{\mu} - t \sum\limits_{j = 1}^{P} \kappa_{j} | {[ \boldsymbol{\Gamma}^{\prime} \mathbf{z} ]_{j}} | - \log(t) \sum\limits_{j = 1}^{P} \lambda_{j} I\left( | {[\boldsymbol{\Gamma}^{\prime} \mathbf{z} ]_{j}}| \neq 0 \right) + O(1) \right\} . \end{array} $$

Therefore, from the condition given in Eq. 34, there exists z such that the tuple $({\sum }_{j = 1}^{P} \kappa _{j} \left | [ \boldsymbol {\Gamma }^{\prime } \mathbf {z} ]_{j} \right |, \mathbf {z}^{\prime } \boldsymbol {\Gamma } \boldsymbol {\mu } )$ is unique for all g = 1, … , G and reduces to the univariate hyperbolic distribution, which is identifiable.

3.3 3.3 Identifiability of the Coalesced Generalized Hyperbolic Distribution

To prove the identifiability of the CGHD, we only need to show that two sets of distributions, the multiple scaled and the generalized hyperbolic distribution are disjoint. Consider moving along the k th eigenvalue such that (λ_k, κ_k) is distinct from (λ₀, κ₀) and the proof easily follows from the identifiability of the univariate generalized hyperbolic distribution.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tortora, C., Franczak, B.C., Browne, R.P. et al. A Mixture of Coalesced Generalized Hyperbolic Distributions. J Classif 36, 26–57 (2019). https://doi.org/10.1007/s00357-019-09319-3

Download citation

Published: 22 April 2019
Issue Date: 15 April 2019
DOI: https://doi.org/10.1007/s00357-019-09319-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Mixture of Coalesced Generalized Hyperbolic Distributions

Abstract

Similar content being viewed by others

Parsimonious ultrametric Gaussian mixture models

Flexible-dimensional L-statistic for mean estimation of symmetric distributions

Novel sampling method for the von Mises–Fisher distribution

1 Introduction

2 Multiple Scaled Distributions

3 Mixture of Multiple Scaled Generalized Hyperbolic Distributions

4 Mixture of Coalesced Generalized Hyperbolic Distributions

5 Cluster Convexity

6 Illustrations

6.1 Implementation and Evaluation

6.2 Simulation Study

6.3 Real Data Analysis

6.4 Computation Time

7 Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendices

Appendix 1: Parameter Estimation

Appendix 2: Quasi-Concavity of the cMSGHD

Definition 1

Definition 2

Lemma 1

Proof

Theorem 1

Proof

Theorem 2

Proof

Appendix 3: Finite Mixture Identifiability

3.1 3.1 Background

Definition 3

3.2 3.2 Identifiability of a Finite Mixture of Multiple Scaled Distributions

3.3 3.3 Identifiability of the Coalesced Generalized Hyperbolic Distribution

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation