1 Introduction

An underlying principal of many statistical methods is to find a useful summarization of a distribution (empirical or otherwise). A common approach in the realm of classification is to determine a set of points that optimally represent a distribution (e.g. cluster analysis). Mean squared error is perhaps the most common optimality criterion in practice. The set of k points that optimally represent a distribution in terms of mean squared error is called a set of kprincipal points. This paper presents results on optimal estimators of principal points for a wide class of multivariate distributions, namely multivariate distributions of location-scale and location-scale-rotation families.

Formally, suppose that \(\varvec{X}\) is a p-dimensional random vector with finite second moments. The k-principal points (Flury 1990) of the distribution of \(\varvec{X}\) is defined to be a set \(\{ \varvec{\gamma }_{1}^{*},\dots ,\varvec{\gamma }_{k}^{*}\}\) that minimizes the mean squared distance

$$\begin{aligned} MSD(\varvec{\gamma }_{1},\dots ,\varvec{\gamma }_{k}) \equiv E[\min _{j=1,\dots ,k}\Vert \varvec{X}-\varvec{\gamma }_{j} \Vert ^2 ]. \end{aligned}$$
(1.1)

The 1-principal point is the mean \(E[\varvec{X}]\) and represents the optimal 1-point summarization of a distribution in terms of mean squared error. A set of k-principal points with \(k\ge 2\) can be seen as a generalization of the mean of the distribution from one point to multiple points.

Theoretical aspects and applications of principal points of univariate and multivariate distributions have been studied in many papers. Tarpey et al. (1995) showed that when \(\{ \varvec{\gamma }_{1}^{*},\dots ,\varvec{\gamma }_{k}^{*}\}\) is a set of k-principal points of the distribution of \(\varvec{X}\), it holds for any \(a\in \mathfrak {R}\), \(\varvec{b}\in \mathfrak {R}^p\), and \(\varvec{G}\in O_p\) that \(\{ a\varvec{G}'\varvec{\gamma }_{1}^{*}+\varvec{b},\dots ,a\varvec{G}'\varvec{\gamma }_{k}^{*}+\varvec{b} \}\) is a set of k-principal points of the distribution of \(a\varvec{G}'\varvec{X}+\varvec{b}\), where \(O_p\) denotes the set of \(p\times p\) orthogonal matrices. Patterns of principal points and their symmetry and uniqueness have been discussed extensively in the literature, e.g., Trushkin (1982), Li and Flury (1995), Tarpey (1995), Tarpey (1998), Yamamoto and Shinozaki (2000a), Gu and Mathew (2001), and Mease and Nair (2006). Connections between principal points and principal subspaces (i.e., subspaces spanned by the first several principal eigenvectors of the covariance matrix) have also been investigated (Tarpey et al. 1995; Yamamoto and Shinozaki 2000b; Kurata 2008; Kurata and Qiu 2011; Matsuura and Kurata 2010, 2011, 2014; Tarpey and Loperfido 2015). Principal points have been studied mainly for continuous distributions, but also for discrete distributions (Gu and Mathew 2001; Yamashita et al. 2017; Yamashita and Goto 2017) and for random functions (Tarpey and Kinateder 2003; Shimizu and Mizuta 2007; Bali and Boente 2009). Applications of principal points to classification in medical applications include distinguishing between drug and placebo responses (Tarpey et al. 2003, 2010; Petkova and Tarpey 2009; Tarpey and Petkova 2010). Ruwet and Haesbroeck (2013) studied the performance of classification using 2-principal points. Vector quantization in the signal processing literature is a closely related topic to principal points (Gersho and Gray 1992; Graf and Luschgy 2000) with many common mathematical connections.

In addition to the theoretical properties, applications of principal points typically require methods of estimating principal points. If \(\varvec{X}\) is assumed to belong to some parametric family, we may estimate the parameters of \(\varvec{X}\) from a random sample and use these parameter estimates to obtain estimates of the k-principal points of \(\varvec{X}\). An alternative but simple approach is to estimate k-principal points by computing the k-principal points of the empirical distribution derived from the random sample, which is a nonparametric estimator of k-principal points. However, as others have shown (Flury 1993; Tarpey 1997, 2007; Stampfer and Stadlober 2002), the maximum likelihood estimator of k-principal points (i.e., k-principal points of the distribution where the unknown parameters are replaced by their maximum likelihood estimates obtained by the random sample) performs better than the nonparametric estimator.

Recently, as an alternative to the maximum likelihood estimator, Matsuura et al. (2015) proposed a minimum expected mean squared distance (EMSD) estimator of k-principal points that minimizes

$$\begin{aligned} \text{ EMSD } =E[MSD(\hat{\varvec{\gamma }}_{1},\dots ,\hat{\varvec{\gamma }}_{k})] =E\left[ \min _{j=1,\dots ,k}\Vert \varvec{X}-\hat{\varvec{\gamma }}_{j} \Vert ^2 \right] , \end{aligned}$$
(1.2)

where \(\{ \hat{\varvec{\gamma }}_{1},\dots ,\hat{\varvec{\gamma }}_{k} \}\) denotes an estimator of k-principal points of the distribution of \(\varvec{X}\).

With point estimation, optimality is typically measured using a loss function that measures how “close” parameter estimates are to the true parameter values, e.g., a mean squared error loss of the form \(E[\Vert \hat{\varvec{\theta }}-\varvec{\theta } \Vert ^2]\) where \(\varvec{\theta }\) is a parameter and \(\hat{\varvec{\theta }}\) is an estimator of \(\varvec{\theta }\). However, in terms of estimating a set of points that best represent the distribution generating the data, a natural criterion, one that we consider in this paper as well as Matsuura et al. (2015), is the EMSD (1.2) that measures the expected squared distance between the point estimates and the random vector.

Matsuura et al. (2015) studied the minimum EMSD estimation of principal points, mainly in the case of univariate distributions of location-scale families. Although they discussed an extension to the multivariate case, their results are restricted to the case where principal points are collinear (i.e., principal points are on a line).

This paper presents a minimum EMSD estimator of k-principal points of multivariate distributions of location-scale families. In particular, we show that a minimum EMSD estimator of k-principal points of a multivariate normal distribution is obtained by using k-principal points of a scaled multivariate t-distribution, which corresponds to the result on the univariate case given by Matsuura et al. (2015). In addition, we give an extension to the case of multivariate distributions of location-scale-rotation families. Several numerical examples are also presented.

2 Minimum EMSD estimator of principal points for multivariate location-scale families

Suppose that the distribution of \(\varvec{X}\) belongs to a location-scale family: its probability density function (pdf) is of the form

$$\begin{aligned} f(\varvec{x};\varvec{\mu },\sigma ) = \frac{1}{\sigma ^p} g\left( \frac{\varvec{x}-\varvec{\mu }}{\sigma } \right) \ \text{ for } \text{ some } \text{ pdf } \ g:\mathfrak {R}^p \rightarrow [0,\infty ), \end{aligned}$$
(2.1)

where \(\varvec{\mu }\in \mathfrak {R}^p\) is a location parameter and \(\sigma >0\) is a scale parameter.

Let \(\varvec{\delta }_{1}^{*},\dots ,\varvec{\delta }_{k}^{*}\) be k-principal points of the distribution with pdf \(g(\varvec{x})\) (i.e., the distribution of \(\frac{\varvec{X}-\varvec{\mu }}{\sigma }\)). Then, k-principal points \(\varvec{\gamma }_{1}^{*},\dots ,\varvec{\gamma }_{k}^{*}\) of the distribution of \(\varvec{X}\) are given by

$$\begin{aligned} \varvec{\gamma }_{j}^{*}=\varvec{\mu } + \sigma \varvec{\delta }_{j}^{*}, \quad j=1,\dots ,k. \end{aligned}$$
(2.2)

Suppose that the values of \(\varvec{\mu }\) and \(\sigma \) are unknown and the function g is known (hence, the covariance matrix of \(\varvec{X}\) is known up to a multiplicative constant). Let \(\varvec{X}_1,\dots ,\varvec{X}_n\) be a random sample of size n drawn from the distribution of \(\varvec{X}\) and let \(\hat{\varvec{\mu }}\) and \(\hat{\sigma }\) be the maximum likelihood estimators of \(\varvec{\mu }\) and \(\sigma \) obtained by the sample \(\varvec{X}_1,\dots ,\varvec{X}_n\), respectively. We note that we assume the existence of maximum likelihood estimators throughout this paper.

Since k-principal points of the distribution of \(\varvec{X}\) are expressed as (2.2), we consider a principal points estimator of the following form:

$$\begin{aligned} \hat{\varvec{\gamma }}_{j}=\hat{\varvec{\mu }} + \hat{\sigma } \varvec{a}_{j}, \quad j=1,\dots ,k. \end{aligned}$$

If we set \(\varvec{a}_{j}=\varvec{\delta }_{j}^{*}, \ j=1,\dots ,k\), then \(\{ \hat{\varvec{\gamma }}_{1},\dots ,\hat{\varvec{\gamma }}_{k} \}\) is the maximum likelihood estimator of k-principal points of the distribution of \(\varvec{X}\).

Here, we consider an alternative aim to derive the best set of values of \(\{ \varvec{a}_{1},\dots ,\varvec{a}_{k} \}\) minimizing EMSD (1.2)

$$\begin{aligned} \text{ EMSD } =E[MSD(\hat{\varvec{\gamma }}_{1},\dots ,\hat{\varvec{\gamma }}_{k})] =E[MSD(\hat{\varvec{\mu }} + \hat{\sigma }\varvec{a}_1,\dots ,\hat{\varvec{\mu }} + \hat{\sigma }\varvec{a}_k)]. \end{aligned}$$
(2.3)

For preparation, we let \(\varvec{X}\) and \(\varvec{X}_1,\dots ,\varvec{X}_n\) be independent and identically distributed with pdf (2.1), and define \(\varvec{U}\equiv \frac{\varvec{X}-\hat{\varvec{\mu }}}{\sigma }\) and \(W \equiv \frac{\hat{\sigma }}{\sigma }\). A straightforward extension of Antle and Bain (1969) shows that the joint distribution of (\(\varvec{U},W\)) does not depend on the unknown parameters \(\varvec{\mu }\) and \(\sigma \). We also let \(\psi (\varvec{u}|W)\) be the pdf of \(\varvec{U}|W\) (the pdf of the conditional distribution of \(\varvec{U}\) given W). Then, the following theorem presents a minimum EMSD estimator of k-principal points of the distribution of \(\varvec{X}\).

Theorem 1

Let

$$\begin{aligned} h(\varvec{z})=\frac{E[ W^{p+2} \psi (\varvec{z}W|W) ]}{E[W^{2}]}. \end{aligned}$$
(2.4)

Then, the values of \(\{ \varvec{a}_1,\dots ,\varvec{a}_k\}\) minimizing (2.3) are given by k-principal points of the distribution with pdf (2.4).

Proof

First, we can easily show that the function (2.4) is a pdf: in fact,

$$\begin{aligned} \int _{\mathfrak {R}^p}h(\varvec{z})d\varvec{z}= & {} \int _{\mathfrak {R}^p}\frac{E[ W^{p+2} \psi (\varvec{z}W|W) ]}{E[W^{2}]}d\varvec{z} \\= & {} \frac{1}{E[W^{2}]}E\left[ W^{p+2} \int _{\mathfrak {R}^p}\psi (\varvec{z}W|W) d\varvec{z}\right] \\= & {} \frac{1}{E[W^{2}]}E\left[ W^{p+2} \int _{\mathfrak {R}^p}\frac{1}{W^p}\psi (\varvec{x}|W) d\varvec{x}\right] \\= & {} \frac{1}{E[W^{2}]}E\left[ W^2 \int _{\mathfrak {R}^p}\psi (\varvec{x}|W) d\varvec{x}\right] \\= & {} \frac{1}{E[W^{2}]}E\left[ W^2 \right] \\= & {} 1. \end{aligned}$$

Then, it follows that

$$\begin{aligned} E[MSD(\hat{\varvec{\mu }} + \hat{\sigma }\varvec{a}_1,\dots ,\hat{\varvec{\mu }} + \hat{\sigma }\varvec{a}_k)]= & {} E[ \min _{j=1,\dots ,k} \Vert \varvec{X}-\hat{\varvec{\mu }} - \hat{\sigma }\varvec{a}_j \Vert ^2 ] \\= & {} \sigma ^{2} E\left[ \min _{j=1,\dots ,k} \left\| \frac{\varvec{X}-\hat{\varvec{\mu }}}{\sigma } - \frac{\hat{\sigma }}{\sigma }\varvec{a}_j \right\| ^2 \right] \\= & {} \sigma ^{2} E[ \min _{j=1,\dots ,k} \Vert \varvec{U} - W\varvec{a}_j \Vert ^2 ] \\= & {} \sigma ^{2} E[ \ E[ \min _{j=1,\dots ,k} \Vert \varvec{U} - W\varvec{a}_j \Vert ^2 |W ] \ ] \\= & {} \sigma ^{2} E\left[ W^2 E\left[ \min _{j=1,\dots ,k} \left\| \frac{\varvec{U}}{W} - \varvec{a}_j \right\| ^2 \biggm | W \right] \ \right] \\= & {} \sigma ^{2} E\left[ W^2 \int _{\mathfrak {R}^p}\min _{j=1,\dots ,k} \left\| \varvec{z} - \varvec{a}_j \right\| ^2 W^p\psi (\varvec{z}W|W) d\varvec{z} \right] \\= & {} \sigma ^{2} \int _{\mathfrak {R}^p}\min _{j=1,\dots ,k} \left\| \varvec{z} - \varvec{a}_j \right\| ^2 E\left[ W^{p+2}\psi (\varvec{z}W|W) \right] d\varvec{z} \\= & {} \sigma ^{2} E[W^{2}] \int _{\mathfrak {R}^p}\min _{j=1,\dots ,k} \left\| \varvec{z} - \varvec{a}_j \right\| ^2 h(\varvec{z})d\varvec{z} \\= & {} \sigma ^{2} E[W^{2}] E[ \min _{j=1,\dots ,k} \Vert \varvec{Z}-\varvec{a}_j\Vert ^2 ], \end{aligned}$$

where in the last line we let \(\varvec{Z}\) be a p-dimensional random vector with pdf (2.4). Hence, we see that the values of \(\{ \varvec{a}_1,\dots ,\varvec{a}_k \}\) minimizing (2.3) are given by k-principal points of the distribution with pdf (2.4). \(\square \)

We note that since the joint distribution of \(\varvec{U}\equiv \frac{\varvec{X}-\hat{\varvec{\mu }}}{\sigma }\) and \(W \equiv \frac{\hat{\sigma }}{\sigma }\) does not depend on the unknown parameters \(\varvec{\mu }\) and \(\sigma \), k-principal points of the distribution with pdf (2.4) also do not depend on the unknown parameters \(\varvec{\mu }\) and \(\sigma \). This theorem is an extension of Matsuura et al. (2015, Theorem 3) that gave a minimum EMSD estimator of principal points of univariate distributions of location-scale families. In fact, if \(p=1\), then \(h(\varvec{z})=\frac{E[ W^{p+2} \psi (\varvec{z}W|W) ]}{E[W^{2}]}\) reduces to the equivalent form of h(z) given in Matsuura et al. (2015, Theorem 3).

Example

(Multivariate normal distribution). Let \(\varvec{X}\) be distributed as \(N(\varvec{\mu },\sigma ^{2}\varvec{\Psi })\), where \(\varvec{\mu }\) and \(\sigma \) are unknown and \(\varvec{\Psi }\) is known. Let \(\varvec{X}_1,\dots ,\varvec{X}_n\) be a random sample of size n drawn from \(N(\varvec{\mu },\sigma ^{2}\varvec{\Psi })\). Then, the maximum likelihood estimators of \(\varvec{\mu }\) and \(\sigma \) are given as follows:

$$\begin{aligned} \hat{\varvec{\mu }}= & {} \bar{\varvec{X}}={1\over n}\sum _{i=1}^{n}{\varvec{X}_i},\\ \hat{\sigma }= & {} \sqrt{\frac{\sum _{i=1}^{n}(\varvec{X}_i-\bar{\varvec{X}})' \varvec{\Psi }^{-1}(\varvec{X}_i-\bar{\varvec{X}})}{np}}. \end{aligned}$$

Let

$$\begin{aligned} \hat{\varvec{\gamma }}_{j} = \hat{\varvec{\mu }}+\hat{\sigma }\varvec{a}_j, \quad j=1,\dots ,k. \end{aligned}$$

Then, we see from the following proposition and Theorem 1 that the values of \(\{ \varvec{a}_1,\dots ,\varvec{a}_k \}\) minimizing (2.3) are given by k-principal points of \(\sqrt{\frac{(n+1)p}{(n-1)p+2}}\varvec{t}_{(n-1)p+2}(\varvec{0},\varvec{\Psi })\), where \(\varvec{t}_{\nu }(\varvec{0},\varvec{\Psi })\) is the multivariate t-distribution with \(\nu \) degrees of freedom, location vector \(\varvec{0}\), and scale matrix \(\varvec{\Psi }\) with pdf

$$\begin{aligned} f_t(\varvec{t}) = \frac{\Gamma (\frac{\nu +p}{2})}{\Gamma (\frac{\nu }{2})\nu ^{\frac{p}{2}}\pi ^{\frac{p}{2}}} |\varvec{\Psi }|^{-\frac{1}{2}} \left( 1+\frac{\varvec{t}'\varvec{\Psi }^{-1}\varvec{t}}{\nu }\right) ^{-\frac{\nu +p}{2}}. \end{aligned}$$

Proposition 2

Let \(\varvec{X}\) and \(\varvec{X}_1,\dots ,\varvec{X}_n\) be independent and identically distributed as \(N(\varvec{\mu },\sigma ^{2}\varvec{\Psi })\). Let \(\hat{\varvec{\mu }}=\bar{\varvec{X}}={1\over n}\sum _{i=1}^{n}{\varvec{X}_i}\) and \(\hat{\sigma }=\sqrt{\frac{\sum _{i=1}^{n}(\varvec{X}_i-\bar{\varvec{X}})'\varvec{\Psi }^{-1}(\varvec{X}_i-\bar{\varvec{X}})}{np}}\). Define \(\varvec{U}\equiv \frac{\varvec{X}-\hat{\varvec{\mu }}}{\sigma }\) and \(W \equiv \frac{\hat{\sigma }}{\sigma }\). Let \(\psi (\varvec{u}|W)\) be the pdf of \(\varvec{U}|W\). Then, the function (2.4) is the pdf of \(\sqrt{\frac{(n+1)p}{(n-1)p+2}}\varvec{t}_{(n-1)p+2}(\varvec{0},\varvec{\Psi })\).

Proof

It is well known that \(\hat{\varvec{\mu }} \sim N(\varvec{\mu },\frac{\sigma ^{2}}{n}\varvec{\Psi })\), \(\frac{np\hat{\sigma }^2}{\sigma ^{2}} \sim \chi ^2((n-1)p)\), and \(\hat{\varvec{\mu }}\) and \(\hat{\sigma }\) are independent. Hence, it holds that \(\varvec{U} \sim N(\varvec{0},\frac{n+1}{n}\varvec{\Psi })\), \(\sqrt{np}W \sim \chi ((n-1)p)\), and \(\varvec{U}\) and W are independent. We see that \(\varvec{U}|W \sim N(\varvec{0},\frac{n+1}{n}\varvec{\Psi })\) and

$$\begin{aligned} \psi (\varvec{u}|W) = (2\pi )^{-\frac{p}{2}} \left( \frac{n+1}{n} \right) ^{-\frac{p}{2}} |\varvec{\Psi }|^{-\frac{1}{2}} \exp \left\{ -\frac{n}{2(n+1)}\varvec{u}'\varvec{\Psi }^{-1}\varvec{u} \right\} . \end{aligned}$$

We also see that \(E[W^2] = \frac{(n-1)p}{np}=\frac{n-1}{n}\) and the pdf of \(\sqrt{np}W\) is

$$\begin{aligned} \kappa (y) = \frac{2^{1-\frac{(n-1)p}{2}}}{\Gamma (\frac{(n-1)p}{2})} y^{(n-1)p-1}\exp \left\{ -\frac{1}{2}y^2\right\} . \end{aligned}$$

Hence,

$$\begin{aligned} h(\varvec{z})= & {} \frac{E[ W^{p+2} \psi (\varvec{z}W|W) ]}{E[W^2]} \nonumber \\= & {} \frac{n}{n-1}E\left[ W^{p+2} (2\pi )^{-\frac{p}{2}} \left( \frac{n+1}{n} \right) ^{-\frac{p}{2}} |\varvec{\Psi }|^{-\frac{1}{2}} \exp \left\{ -\frac{n}{2(n+1)}W^2\varvec{z}'\varvec{\Psi }^{-1}\varvec{z} \right\} \right] \nonumber \\= & {} \frac{1}{(2\pi )^{\frac{p}{2}}(n-1)(n+1)^{\frac{p}{2}}p^{\frac{p+2}{2}}}| \varvec{\Psi }|^{-\frac{1}{2}}\nonumber \\&\times \, E\left[ (\sqrt{np}W)^{p+2} \exp \left\{ -\frac{1}{2(n+1)p}(\sqrt{np}W)^2\varvec{z}'\varvec{\Psi }^{-1}\varvec{z} \right\} \right] \nonumber \\= & {} \frac{1}{(2\pi )^{\frac{p}{2}}(n-1)(n+1)^{\frac{p}{2}}p^{\frac{p+2}{2}}} |\varvec{\Psi }|^{-\frac{1}{2}} \nonumber \\&\ \int _{0}^{\infty }y^{p+2} \exp \left\{ -\frac{1}{2(n+1)p}y^2\varvec{z}'\varvec{\Psi }^{-1}\varvec{z} \right\} \frac{2^{1-\frac{(n-1)p}{2}}}{\Gamma (\frac{(n-1)p}{2})} y^{(n-1)p-1}\exp \left\{ -\frac{1}{2}y^2\right\} dy \nonumber \\= & {} \frac{2^{1-\frac{np}{2}}}{\Gamma (\frac{(n-1)p}{2})\pi ^{\frac{p}{2}}(n-1)(n+1)^{\frac{p}{2}}p^{\frac{p+2}{2}}} |\varvec{\Psi }|^{-\frac{1}{2}} \int _{0}^{\infty } y^{np+1} \exp \left\{ -\frac{1}{2}y^2\left( 1+\frac{\varvec{z}'\varvec{\Psi }^{-1}\varvec{z}}{(n+1)p}\right) \right\} dy \nonumber \\= & {} \frac{2^{1-\frac{np}{2}}}{\Gamma (\frac{(n-1)p}{2})\pi ^{\frac{p}{2}}(n-1)(n+1)^{\frac{p}{2}}p^{\frac{p+2}{2}}} |\varvec{\Psi }|^{-\frac{1}{2}} \frac{\Gamma (\frac{np+2}{2})}{2^{1-\frac{np+2}{2}}} \frac{2^{1-\frac{np+2}{2}}}{\Gamma (\frac{np+2}{2})} \nonumber \\&\ \int _{0}^{\infty } \left\{ t \left( 1+\frac{\varvec{z}'\varvec{\Psi }^{-1}\varvec{z}}{(n+1)p}\right) ^{-\frac{1}{2}} \right\} ^{np+1} \exp \left\{ -\frac{1}{2}t^2 \right\} \left( 1+\frac{\varvec{z}'\varvec{\Psi }^{-1}\varvec{z}}{(n+1)p}\right) ^{-\frac{1}{2}} dt \nonumber \\= & {} \frac{2\cdot \Gamma (\frac{np+2}{2})}{\Gamma (\frac{(n-1)p}{2})\pi ^{\frac{p}{2}}(n-1)(n+1)^{\frac{p}{2}}p^{\frac{p+2}{2}}} |\varvec{\Psi }|^{-\frac{1}{2}} \left( 1+\frac{\varvec{z}'\varvec{\Psi }^{-1}\varvec{z}}{(n+1)p}\right) ^{-\frac{np+2}{2}} \nonumber \\&\ \int _{0}^{\infty } \frac{2^{1-\frac{np+2}{2}}}{\Gamma (\frac{np+2}{2})}t^{np+1} \exp \left\{ -\frac{1}{2}t^2 \right\} dt \nonumber \\= & {} \frac{\Gamma (\frac{np+2}{2})}{\Gamma (\frac{(n-1)p+2}{2}) \pi ^{\frac{p}{2}}(n+1)^{\frac{p}{2}}p^{\frac{p}{2}}} |\varvec{\Psi }|^{-\frac{1}{2}} \left( 1+\frac{\varvec{z}'\varvec{\Psi }^{-1}\varvec{z}}{(n+1)p}\right) ^{-\frac{np+2}{2}}. \end{aligned}$$
(2.5)

We note that the pdf of \(\varvec{t}_{(n-1)p+2}(\varvec{0},\varvec{\Psi })\) is

$$\begin{aligned} g_1(\varvec{t}) = \frac{\Gamma (\frac{np+2}{2})}{\Gamma (\frac{(n-1)p+2}{2})\{(n-1)p+2\}^{\frac{p}{2}}\pi ^{\frac{p}{2}}} |\varvec{\Psi }|^{-\frac{1}{2}} \left( 1+\frac{\varvec{t}'\varvec{\Psi }^{-1}\varvec{t}}{(n-1)p+2}\right) ^{-\frac{np+2}{2}}, \end{aligned}$$

and the pdf of \(\sqrt{\frac{(n+1)p}{(n-1)p+2}}\varvec{t}_{(n-1)p+2}(\varvec{0},\varvec{\Psi })\) is given as

$$\begin{aligned} g_2(\varvec{z})= & {} \left( \sqrt{\frac{(n-1)p+2}{(n+1)p}} \right) ^{p} g_1\left( \varvec{z}\sqrt{\frac{(n-1)p+2}{(n+1)p}}\right) \\= & {} \frac{\{ (n-1)p+2 \}^{\frac{p}{2}}}{\{ (n+1)p \}^{\frac{p}{2}}} \frac{\Gamma (\frac{np+2}{2})}{\Gamma (\frac{(n-1)p+2}{2})\{(n-1)p+2\}^{\frac{p}{2}}\pi ^{\frac{p}{2}}} |\varvec{\Psi }|^{-\frac{1}{2}} \left( 1+\frac{\varvec{z}'\varvec{\Psi }^{-1}\varvec{z}}{(n+1)p}\right) ^{-\frac{np+2}{2}}\\= & {} \frac{\Gamma (\frac{np+2}{2})}{\Gamma (\frac{(n-1)p+2}{2})\pi ^{\frac{p}{2}}\{ (n+1)p \}^{\frac{p}{2}}} |\varvec{\Psi }|^{-\frac{1}{2}} \left( 1+\frac{\varvec{z}'\varvec{\Psi }^{-1}\varvec{z}}{(n+1)p}\right) ^{-\frac{np+2}{2}}. \end{aligned}$$

Thus, we see that (2.5) is the pdf of \(\sqrt{\frac{(n+1)p}{(n-1)p+2}}\varvec{t}_{(n-1)p+2}(\varvec{0},\varvec{\Psi })\). \(\square \)

Since the covariance matrix of \(\varvec{t}_{\nu }(\varvec{0},\varvec{\Psi })\) is \(\frac{\nu }{\nu -2}\varvec{\Psi }\), we see that, for any dimension p of the distribution, the minimum EMSD estimator is based on k-principal points of a scaled multivariate t-distribution with mean \(\varvec{0}\) and covariance matrix \(\frac{n+1}{n-1}\varvec{\Psi }\). We also see that, when the sample size n goes to infinity, the minimum EMSD estimator tends to be the k-principal points of \(N(\varvec{0},\varvec{\Psi })\). These results correspond to the results of Matsuura et al. (2015, Section 4.2) that gave a minimum EMSD estimator of the best k collinear points.

Numerical example

We assume that \(\varvec{X}\sim N\left( \varvec{\mu },\sigma ^{2} {2^2 \ 0\atopwithdelims ()\ 0 \ \ 1^2} \right) \) and consider the case in which 4-principal points of the distribution of \(\varvec{X}\) are estimated from a random sample of size n. The actual 4-principal points of \(\varvec{X}\) are expressed by

$$\begin{aligned} \varvec{\gamma }_1= & {} \varvec{\mu }+\sigma {2.5679 \atopwithdelims ()0}, \ \varvec{\gamma }_2=\varvec{\mu }+\sigma {0 \atopwithdelims ()0.8693}, \ \varvec{\gamma }_3=\varvec{\mu }+\sigma {0 \atopwithdelims ()-0.8693}, \\ \varvec{\gamma }_4= & {} \varvec{\mu }+\sigma {-2.5679 \atopwithdelims ()0}, \end{aligned}$$

and the MSD (1.1) value is \(1.4180\sigma ^{2}\).

Here, we suppose that \(\varvec{\mu }\) and \(\sigma \) are unknown and compare the maximum likelihood estimator and the minimum EMSD estimator of 4-principal points. The maximum likelihood estimator of 4-principal points is simply given by

$$\begin{aligned} \hat{\varvec{\gamma }}_1= & {} \hat{\varvec{\mu }}+\hat{\sigma } {2.5679 \atopwithdelims ()0}, \ \hat{\varvec{\gamma }}_2=\hat{\varvec{\mu }}+\hat{\sigma } {0 \atopwithdelims ()0.8693}, \ \hat{\varvec{\gamma }}_3=\hat{\varvec{\mu }}+\hat{\sigma } {0 \atopwithdelims ()-0.8693}, \\ \hat{\varvec{\gamma }}_4= & {} \hat{\varvec{\mu }}+\hat{\sigma } {-2.5679 \atopwithdelims ()0}, \end{aligned}$$

while the minimum EMSD estimator of 4-principal points is expressed by

$$\begin{aligned} \hat{\varvec{\gamma }}_1=\hat{\varvec{\mu }}+\hat{\sigma } \varvec{a}_{1}^{*}, \ \hat{\varvec{\gamma }}_2=\hat{\varvec{\mu }}+\hat{\sigma } \varvec{a}_{2}^{*}, \ \hat{\varvec{\gamma }}_3=\hat{\varvec{\mu }}+\hat{\sigma } \varvec{a}_{3}^{*}, \ \hat{\varvec{\gamma }}_4=\hat{\varvec{\mu }}+\hat{\sigma } \varvec{a}_{4}^{*}, \end{aligned}$$

where \(\varvec{a}_{1}^{*},\varvec{a}_{2}^{*},\varvec{a}_{3}^{*},\varvec{a}_{4}^{*}\) are 4-principal points of \(\sqrt{\frac{(n+1)p}{(n-1)p+2}}\varvec{t}_{(n-1)p+2}(\varvec{0},\varvec{\Psi })\) with \(p=2\) and \(\varvec{\Psi }={2^2 \ 0\atopwithdelims ()\ 0 \ \ 1^2}\).

Table 1 shows the results obtained by numerical computations for the maximum likelihood estimator and the minimum EMSD estimator of 4-principal points and their EMSD (2.3) values when the sample size is \(n=2,3,5,7,10\). We see from Table 1 that the minimum EMSD estimator gives much smaller EMSD values than the maximum likelihood estimator, especially for small sample size. It is interesting to note that, as Tarpey (1998) showed, the 4-principal points of \(N\left( \varvec{\mu },\sigma ^2 {\tau ^2 \ 0\atopwithdelims ()\ 0 \ \ 1^2} \right) \) form a “cross” pattern for \(1\le \tau \le 2.15\) (approximately) and form a line pattern (i.e. the 4 points lie on the first principal component axis) for \(\tau > 2.15\). Hence, since we set \(\tau =2\) in this example, the actual 4-principal points form a cross pattern and the maximum likelihood estimator also forms a cross pattern; however, the minimum EMSD estimator forms a line pattern when \(n=2,3,5\). It is also interesting to note that in the case where both the maximum likelihood estimator and the minimum EMSD estimator form cross patterns (\(n=7\) and 10 in Table 1), the minimum EMSD estimator points are spread out more than the maximum likelihood estimator points, i.e. in order to minimize the EMSD, the points need to move further away from the mean. It is also worth noting that nonparametric estimators of principal points obtained by the k-means algorithm (Hartigan and Wong 1979) are not available when \(k>n\) and they are extremely unstable when \(k\approx n\).

Table 1 Estimator \(\hat{\varvec{\gamma }}_j=\hat{\varvec{\mu }}+\hat{\sigma } \varvec{a}_j, j=1,2,3,4\) of 4-principal points and EMSD values for \(n=2,3,5,7,10\).

3 Extension to multivariate location-scale-rotation families

In this section, we extend Theorem 1 to multivariate location-scale-rotation families.

The distribution of \(\varvec{X}\) belongs to a location-scale-rotation family if its pdf is of the form

$$\begin{aligned} f(\varvec{x};\varvec{\mu },\sigma ,\varvec{H}) = \frac{1}{\sigma ^p} g\left( \frac{\varvec{H}'(\varvec{x}-\varvec{\mu })}{\sigma } \right) \ \text{ for } \text{ some } \text{ pdf } \ g:\mathfrak {R}^p \rightarrow [0,\infty ), \end{aligned}$$
(3.1)

where \(\varvec{\mu }\in \mathfrak {R}^p\) is a location parameter, \(\sigma >0\) is a scale parameter, and \(\varvec{H}\in O_p\) is a rotation parameter.

Let \(\varvec{\delta }_{1}^{*},\dots ,\varvec{\delta }_{k}^{*}\) be k-principal points of the distribution with pdf \(g(\varvec{x})\). Then, k-principal points \(\varvec{\gamma }_{1}^{*},\dots ,\varvec{\gamma }_{k}^{*}\) of the distribution of \(\varvec{X}\) are given by

$$\begin{aligned} \varvec{\gamma }_{j}^{*}=\varvec{\mu } + \sigma \varvec{H} \varvec{\delta }_{j}^{*}, \ j=1,\dots ,k. \end{aligned}$$
(3.2)

Suppose that \(\varvec{\mu }\), \(\sigma \), and \(\varvec{H}\) are unknown and the function g is known. Let \(\varvec{X}_1,\dots ,\varvec{X}_n\) be a random sample of size n drawn from the distribution of \(\varvec{X}\) and let \(\hat{\varvec{\mu }}\), \(\hat{\sigma }\), and \(\hat{\varvec{H}}\) be the maximum likelihood estimators of \(\varvec{\mu }\), \(\sigma \) and \(\varvec{H}\) obtained by the sample \(\varvec{X}_1,\dots ,\varvec{X}_n\), respectively.

Since k-principal points of the distribution of \(\varvec{X}\) are expressed as (3.2), the following form of a principal points estimator is considered:

$$\begin{aligned} \hat{\varvec{\gamma }}_{j} = \hat{\varvec{\mu }}+\hat{\sigma }\hat{\varvec{H}}\varvec{a}_j, \ j=1,\dots ,k. \end{aligned}$$

Before deriving a minimum EMSD estimator of k-principal points, we present some preparatory results. We let \(\varvec{X}\) and \(\varvec{X}_1,\dots ,\varvec{X}_n\) be independent and identically distributed with pdf (3.1), and define \(\varvec{U}_{Rot}\equiv \frac{\hat{\varvec{H}}'(\varvec{X}-\hat{\varvec{\mu }})}{\sigma }\) and \(W \equiv \frac{\hat{\sigma }}{\sigma }\). Here, we show the following Lemma, which is a straightforward extension of Antle and Bain (1969).

Lemma 1

The joint distribution of \(\varvec{U}_{Rot}\equiv \frac{\hat{\varvec{H}}'(\varvec{X}-\hat{\varvec{\mu }})}{\sigma }\) and \(W \equiv \frac{\hat{\sigma }}{\sigma }\) does not depend on \(\varvec{\mu }\), \(\sigma \), and \(\varvec{H}\).

Proof

We let \(\varvec{Y}\equiv \frac{\varvec{H}'(\varvec{X}-\varvec{\mu })}{\sigma }\), \(\varvec{Y}_i \equiv \frac{\varvec{H}'(\varvec{X}_i-\varvec{\mu })}{\sigma }, \ i=1,\dots ,n\), \(\hat{\varvec{u}}\equiv \frac{\varvec{H}'(\hat{\varvec{\mu }}-\varvec{\mu })}{\sigma }\), \(\hat{w} \equiv \frac{\hat{\sigma }}{\sigma }\), and \(\hat{\varvec{V}}\equiv \varvec{H}'\hat{\varvec{H}}\). It is obvious that the distributions of \(\varvec{Y}\) and \(\varvec{Y}_i, \ i=1,\dots ,n\) (whose pdfs are all \(g(\varvec{y})\)) do not depend on \(\varvec{\mu }\), \(\sigma \), and \(\varvec{H}\). Since

$$\begin{aligned} \{ \hat{\varvec{\mu }},\hat{\sigma },\hat{\varvec{H}} \} = \arg \max _{\varvec{\mu }_0,\sigma _0,\varvec{H}_0}\prod _{i=1}^{n} \left\{ \frac{1}{\sigma _{0}^{p}} g\left( \frac{\varvec{H}_0'(\varvec{X}_i-\varvec{\mu }_0)}{\sigma _0} \right) \right\} , \end{aligned}$$

we see that

$$\begin{aligned} \{\hat{\varvec{u}},\hat{w},\hat{\varvec{V}} \}= & {} \arg \max _{\varvec{u},w,\varvec{V}}\prod _{i=1}^{n} \left\{ \frac{1}{w^p \sigma ^p} g\left( \frac{\varvec{V}'\varvec{H}' \{ (\varvec{\mu }+\sigma \varvec{H} \varvec{Y}_i)-(\varvec{\mu }+\sigma \varvec{H} \varvec{u})\} }{w \sigma } \right) \right\} \\= & {} \arg \max _{\varvec{u},w,\varvec{V}}\prod _{i=1}^{n} \left\{ \frac{1}{w^p} g\left( \frac{\varvec{V}' (\varvec{Y}_i-\varvec{u}) }{w} \right) \right\} . \end{aligned}$$

Hence, the joint distribution of (\(\hat{\varvec{u}},\hat{w},\hat{\varvec{V}}\)) does not depend on \(\varvec{\mu }\), \(\sigma \), and \(\varvec{H}\). Since \(\varvec{U}_{Rot}\equiv \frac{\hat{\varvec{H}}'(\varvec{X}-\hat{\varvec{\mu }})}{\sigma } =\hat{\varvec{V}}'(\varvec{Y}-\hat{\varvec{u}})\) and \(W \equiv \frac{\hat{\sigma }}{\sigma }=\hat{w}\), the joint distribution of (\(\varvec{U}_{Rot},W\)) does not depend on \(\varvec{\mu }\), \(\sigma \), and \(\varvec{H}\).

We also let \(\psi _{Rot}(\varvec{u}|W)\) be the pdf of \(\varvec{U}_{Rot}|W\). Then, a minimum EMSD estimator of principal points is given as the following theorem.

Theorem 3

Let

$$\begin{aligned} h_{Rot}(\varvec{z})=\frac{E[ W^{p+2} \psi _{Rot}(\varvec{z}W|W) ]}{E[W^{2}]}. \end{aligned}$$
(3.3)

Then, the values of \(\{ \varvec{a}_1,\dots ,\varvec{a}_k \}\) minimizing

$$\begin{aligned} \text{ EMSD } =E[MSD(\hat{\varvec{\gamma }}_{1},\dots ,\hat{\varvec{\gamma }}_{k})] =E[MSD(\hat{\varvec{\mu }} + \hat{\sigma }\hat{\varvec{H}}\varvec{a}_1,\dots ,\hat{\varvec{\mu }} + \hat{\sigma }\hat{\varvec{H}}\varvec{a}_k)]\quad \end{aligned}$$
(3.4)

are given by k-principal points of the distribution with pdf (3.3).

Proof

Noting that \(\Vert \varvec{x} \Vert = \Vert \varvec{G}'\varvec{x}\Vert \) holds for any orthogonal matrix \(\varvec{G}\), we see that

$$\begin{aligned}&E[MSD(\hat{\varvec{\mu }} + \hat{\sigma }\hat{\varvec{H}}\varvec{a}_1,\dots ,\hat{\varvec{\mu }} + \hat{\sigma }\hat{\varvec{H}}\varvec{a}_k)] \\&\quad = E\left[ \min _{j=1,\dots ,k} \Vert \varvec{X}-\hat{\varvec{\mu }} - \hat{\sigma }\hat{\varvec{H}}\varvec{a}_j \Vert ^2 \right] \\&\quad = E\left[ \min _{j=1,\dots ,k} \Vert \hat{\varvec{H}}'(\varvec{X}-\hat{\varvec{\mu }} - \hat{\sigma }\hat{\varvec{H}}\varvec{a}_j) \Vert ^2 \right] \\&\quad = \sigma ^{2} E\left[ \min _{j=1,\dots ,k} \left\| \frac{\hat{\varvec{H}}'(\varvec{X}-\hat{\varvec{\mu }})}{\sigma } - \frac{\hat{\sigma }}{\sigma }\varvec{a}_j \right\| ^2 \right] \\&\quad = \sigma ^{2} E\left[ \min _{j=1,\dots ,k} \Vert \varvec{U}_{Rot} - W\varvec{a}_j \Vert ^2 \right] . \end{aligned}$$

The remainder of the proof follows exactly as in the proof of Theorem 1 and is thus omitted. \(\square \)

We note that Theorem 3 treats a wider model (location-scale-rotation family) than Theorem 1 and Proposition 2 (location-scale family): The rotation parameter \(\varvec{H}\) is assumed to be known in Sect. 2, while \(\varvec{H}\) is unknown and is estimated in this section. This means that if \(\varvec{\Sigma }=\sigma ^2\varvec{\Psi }=\sigma ^2\varvec{H}\varvec{\Lambda }\varvec{H}'\) denotes a spectral decomposition of the covariance matrix \(\varvec{\Sigma }\), then Theorem 1 and Proposition 2 need \(\varvec{\Psi }\) to be known but Theorem 3 needs only \(\varvec{\Lambda }\) to be known. Hence, we can reduce the number of parameters assumed to be known for location-scale families from the entire covariance matrix (up to a scalar factor) to just needing to know the eigenvalues \(\varvec{\Lambda }\) (up to a scalar factor).

Numerical example

Here is a simple numerical illustration. Assume that \(\varvec{X}\sim N\left( \varvec{\mu },\sigma ^{2}\varvec{H} {\lambda ^2 \ 0\atopwithdelims ()\ 0 \ \ 1^2} \varvec{H}' \right) \) with \(\lambda > 1\) and \(\varvec{H}\in O_2\), and consider the case in which 2-principal points are estimated from a random sample of size \(n=2\). The actual 2-principal points are given by

$$\begin{aligned} \varvec{\gamma }_1=\varvec{\mu }+\sigma \varvec{H} {\sqrt{\frac{2}{\pi }}\lambda \atopwithdelims ()0}, \ \varvec{\gamma }_2=\varvec{\mu }+\sigma \varvec{H} {-\sqrt{\frac{2}{\pi }}\lambda \atopwithdelims ()0}, \end{aligned}$$

and the MSD (1.1) value is \(\{ (1-2/\pi )\lambda ^2 + 1\} \sigma ^{2}\).

Suppose that \(\varvec{\mu }\), \(\sigma \), and \(\varvec{H}\) are unknown and \(\lambda \) is known. Then, the maximum likelihood estimators of \(\varvec{\mu }\), \(\sigma \), and \(\varvec{H}\) obtained by the random sample \(\varvec{X}_1,\varvec{X}_2\) of size \(n=2\) are given as follows:

$$\begin{aligned} \hat{\varvec{\mu }}= & {} \frac{\varvec{X}_1+\varvec{X}_2}{2},\\ \hat{\sigma }= & {} \frac{\Vert \varvec{X}_1-\varvec{X}_2\Vert }{2\sqrt{2}\lambda },\\ \hat{\varvec{H}}= & {} \left( \begin{array}{cc} \cos \hat{\theta } &{} -\sin \hat{\theta }\\ \sin \hat{\theta } &{} \cos \hat{\theta } \end{array} \right) , \end{aligned}$$

where

$$\begin{aligned} {\cos \hat{\theta } \atopwithdelims ()\sin \hat{\theta }} = \frac{\varvec{X}_1-\varvec{X}_2}{\Vert \varvec{X}_1-\varvec{X}_2 \Vert }. \end{aligned}$$

Here, we compare the following three estimators of 2-principal points. One is to use the maximum likelihood estimator of 2-principal points:

$$\begin{aligned} \hat{\varvec{\gamma }}_1=\hat{\varvec{\mu }}+\hat{\sigma } \hat{\varvec{H}} {\sqrt{\frac{2}{\pi }}\lambda \atopwithdelims ()0}, \ \hat{\varvec{\gamma }}_2=\hat{\varvec{\mu }}+\hat{\sigma } \hat{\varvec{H}} {-\sqrt{\frac{2}{\pi }}\lambda \atopwithdelims ()0}. \end{aligned}$$

This is simple but not optimal. Another estimator is to use the location-scale-family-based minimum EMSD estimator presented in Sect. 2 by substituting \(\hat{\varvec{H}} {\lambda ^2 \ 0\atopwithdelims ()\ 0 \ \ 1^2} \hat{\varvec{H}}'\) for the \(\varvec{\Psi }\), that is, to use

$$\begin{aligned} \hat{\varvec{\gamma }}_1=\hat{\varvec{\mu }}+\hat{\sigma }\varvec{b}_{1}^{\dagger }, \ \hat{\varvec{\gamma }}_2=\hat{\varvec{\mu }}+\hat{\sigma }\varvec{b}_{2}^{\dagger }, \end{aligned}$$
(3.5)

where \(\varvec{b}_{1}^{\dagger },\varvec{b}_{2}^{\dagger }\) are 2-principal points of \(\sqrt{\frac{(n+1)p}{(n-1)p+2}}\varvec{t}_{(n-1)p+2}(\varvec{0},\varvec{\tilde{\Psi }})\) with \(n=2\), \(p=2\), and \(\varvec{\tilde{\Psi }}=\hat{\varvec{H}} {\lambda ^2 \ 0\atopwithdelims ()\ 0 \ \ 1^2} \hat{\varvec{H}}'\). If \(\varvec{H}\) is known and \(\varvec{\tilde{\Psi }}=\varvec{H} {\lambda ^2 \ 0\atopwithdelims ()\ 0 \ \ 1^2} \varvec{H}'\) is used, then this approach will be optimal. However, in case \(\varvec{H}\) is unknown and \(\hat{\varvec{H}}\) is used, this approach is suboptimal. We note that (3.5) is equivalent to

$$\begin{aligned} \hat{\varvec{\gamma }}_1=\hat{\varvec{\mu }}+\hat{\sigma } \hat{\varvec{H}} \varvec{a}_{1}^{\dagger }, \ \hat{\varvec{\gamma }}_2=\hat{\varvec{\mu }}+\hat{\sigma } \hat{\varvec{H}} \varvec{a}_{2}^{\dagger }, \end{aligned}$$

where \(\varvec{a}_{1}^{\dagger },\varvec{a}_{2}^{\dagger }\) are 2-principal points of \(\sqrt{\frac{(n+1)p}{(n-1)p+2}}\varvec{t}_{(n-1)p+2}\left( \varvec{0},{\lambda ^2 \ 0\atopwithdelims ()\ 0 \ \ 1^2}\right) \) with \(n=2\) and \(p=2\). The third option is to use the location-scale-rotation-family-based minimum EMSD estimator presented in this section:

$$\begin{aligned} \hat{\varvec{\gamma }}_1=\hat{\varvec{\mu }}+\hat{\sigma } \hat{\varvec{H}} \varvec{a}_{1}^{*}, \ \hat{\varvec{\gamma }}_2=\hat{\varvec{\mu }}+\hat{\sigma } \hat{\varvec{H}} \varvec{a}_{2}^{*}, \end{aligned}$$

where \(\varvec{a}_{1}^{*},\varvec{a}_{2}^{*}\) are 2-principal points of the distribution with pdf (3.3), which is the optimal procedure in this case.

Table 2 shows the results obtained by numerical computations for the maximum likelihood estimator (denoted by MLE), the location-scale-family-based minimum EMSD estimator (denoted by LSF-MEMSDE), and the location-scale-rotation-family-based minimum EMSD estimator (denoted by LSRF-MEMSDE) for \(\lambda =1.5,2,2.5,3\). The EMSD (3.4) values of the three estimators are also shown. Table 2 indicates that the LSRF-MEMSDE gives much smaller EMSD values than the MLE. The difference between the LSF-MEMSDE and the LSRF-MEMSDE is relatively small, but may not be negligible especially when \(\lambda \) is large. In this example, the matrix of eigenvectors \(\varvec{H}\) of the covariance matrix plays the role of the rotation parameter but we note that Theorem 3 applies to any rotation matrix in the location-scale-rotation family.

Table 2 Estimator \(\hat{\varvec{\gamma }}_j=\hat{\varvec{\mu }}+\hat{\sigma } \hat{\varvec{H}}\varvec{a}_j, \ j=1,2\) of 2-principal points and EMSD values for \(\lambda =1.5,2,2.5,3\)

4 Conclusion and remark

Cluster analysis applications occur mostly in multivariate settings. Efficiency in estimating cluster means can be improved by incorporating parametric information, namely, using information on the principal points of the underlying parametric distribution.

In this paper, we have discussed the parametric estimation of principal points for a wide class of multivariate distributions. More specifically, we have derived optimal estimators of principal points minimizing the expected mean squared distance (EMSD) for location-scale and location-scale-rotation families. Numerical results have suggested that the minimum EMSD estimators given in this paper enable substantial improvements over conventional maximum likelihood estimators.

A possible extension of our results will be to the case when the covariance matrix is fully unknown. Different approaches may be needed to study the optimality of principal points estimators in this important and interesting case.

The final remark of this paper is that the proofs of Theorems 1 and 3 depend on the properties of maximum likelihood estimators only through their equivariance properties, which hold in general under a very mild condition (Eaton 1983, Section 7.4), with respect to the transformation \(\varvec{X}_i \rightarrow a\varvec{X}_i + \varvec{b}\) (in Theorem 1) or \(\varvec{X}_i \rightarrow a\varvec{C}\varvec{X}_i + \varvec{b}\) (in Theorem 3) with \(a > 0\), \(\varvec{b} \in \mathfrak {R}^p\), and \(\varvec{C}\in O_p\) (\(i = 1,2,\dots ,n\)), e.g., \(\hat{\varvec{\mu }} \rightarrow a\varvec{C}\hat{\varvec{\mu }}+\varvec{b}\), \(\hat{\sigma } \rightarrow a\hat{\sigma }\), and \(\hat{\varvec{H}} \rightarrow \varvec{C}\hat{\varvec{H}}\) when \(\varvec{X}_i \rightarrow a\varvec{C}\varvec{X}_i + \varvec{b}\). Hence, our results remain valid for a class of equivariant estimators of \(\varvec{\mu }\), \(\sigma \) (and \(\varvec{H}\)).