1 Introduction

Let \(\mathbf{X}_k = (X_{k1}, \ldots , X_{kp})' \sim {\mathcal {F}}\), \(k = 1, \ldots , n\) be iid random vectors, where \({\mathcal {F}}\) denotes a p-variate distribution, with \(E(\mathbf{X}_k) = {\varvec{\mu }} \in {\mathbb {R}}^p\) and \({{\mathrm{Cov}}}(\mathbf{X}_k) = {\varvec{\Sigma }} \in {\mathbb {R}}^{p\times p}_{> 0}\). A hypothesis of foremost interest to be tested in this setup is \(H_0: {\varvec{\mu }} = \mathbf{0}\) against an appropriate alternative, say \(H_1:\) Not \(H_0\). For an extension to \(g \ge 2\) samples, let \(\mathbf{X}_{ik} = (X_{ik1}, \ldots , X_{ikp})' \sim {\mathcal {F}}_i\) be iid random vectors with \(E(\mathbf{X}_{ik}) = {\varvec{\mu }}_i \in {\mathbb {R}}^p\), \({{\mathrm{Cov}}}(\mathbf{X}_{ik}) = {\varvec{\Sigma }}_i \in {\mathbb {R}}^{p\times p}_{>0}\), \(k = 1, \ldots , n_i\), \(i = 1, \ldots , g\). The corresponding hypothesis of interest is \(H_{0g}:{\varvec{\mu }}_1 = \cdots = {\varvec{\mu }}_g\) vs. \(H_{1g}:\) Not \(H_{0g}\).

Our objective here is to present test statistics for the aforementioned one- and multi-sample hypotheses when \(p > n_i\), \({\mathcal {F}}_i\)’s are not necessarily normal and \({\varvec{\Sigma }}_i\), likewise \(n_i\), in the multi-sample case may be unequal. The proposed tests are thus valid for high-dimensional, non-normal, unbalanced data under Behrens–Fisher problem. In particular, for \(g \ge 3\), it refers to testing high-dimensional one-way MANOVA hypothesis under non-normality and multi-sample Behrens–Fisher problem.

When \(p < n_i\), tests of \(H_0\) or \(H_{0g}\) are most often carried out by Hotelling’s \(T^2\) or Wilks’ Lambda statistic which are uniformly most powerful invariant likelihood ratio tests. They, however, collapse for high-dimensional case, particularly due to singularity of the empirical covariance matrix involved (see Sects. 2, 3). A number of proposals have recently been put forth in the literature on the modification of these classical tests for high-dimensional data.

Whereas most modifications assume normality, some of them are based on a more flexible model, and still others offer completely nonparametric solution to the problem. Likewise holds for homoscedasticity assumption, \({\varvec{\Sigma }}_i = {\varvec{\Sigma }}\)\(\forall ~i = 1, \ldots g \ge 2\). For details, see e.g., Dempster (1958), Bai and Saranadasa (1996), Läuter et al. (1998), Läuter (2004), Fujikoshi (2004), Schott (2007), Chen and Qin (2010), Aoshima and Yata (2011, 2015), Katayama and Kano (2014), Wang et al. (2015), Feng et al. (2016) and Hu et al. (2017). For a review, see Hu and Bai (2015) and Fujikoshi et al. (2010).

We present a coherent testing theory encompassing one- and multi-sample cases. The construction of the tests, the assumptions, and the strategy of obtaining limiting distribution of the test statistics is succinctly threaded together via a common approach, initiating with the one-sample case and extending systematically to the multi-sample cases. The main distinguishing feature of the proposed tests is that we simultaneously relax commonly adopted linear model assumptions such as normality and homoscedasticity, for all cases up to one-way MANOVA. Further, all tests are defined in terms of U-statistics with simple, bivariate, product kernels composed of bilinear forms of independent vectors. This helps us determine the limits of the test statistics for a general multivariate model. These limits are derived under \((n_i, p)\)- or high-dimensional asymptotics, i.e., \(n_i, p \rightarrow \infty \), using only a few mild assumptions.

The basic idea is introduced in detail for one-sample case in the next section, with an extension to two-sample case. Multi-sample extension follows in Sect. 3. Sections 4 and 5 deal with simulations and applications. Proofs and technical details are deferred to “Appendix”.

2 The one- and two-sample tests

2.1 The one-sample case

For the one-sample data setup in Sect. 1, let the unbiased estimators of \({\varvec{\mu }}\) and \({\varvec{\Sigma }}\) be defined as \(\overline{\mathbf{X}} = \sum _{k = 1}^n\mathbf{X}_k/n\) and \(\widehat{\varvec{\Sigma }} = \sum _{k = 1}^n(\mathbf{X}_k - \overline{\mathbf{X}})(\mathbf{X}_k - \overline{\mathbf{X}})'/(n - 1)\). If \(n > p\) and \({\mathcal {F}}\) is multivariate normal, then \(H_0: {\varvec{\mu }} = \mathbf{0}\) can be tested using Hotelling’s statistic \(T^2 = n\overline{\mathbf{X}}'\widehat{\varvec{\Sigma }}^{-1}\overline{\mathbf{X}} = \overline{\mathbf{X}}'[\widehat{\varvec{\Sigma }}/n]^{-1}\overline{\mathbf{X}}\) where \(\widehat{\varvec{\Sigma }}/n\) estimates \({\varvec{\Sigma }}/n = {{\mathrm{Cov}}}(\overline{\mathbf{X}})\). When \(p > n\), \(\widehat{\varvec{\Sigma }}\) is singular and \(T^2\) collapses, requiring a careful modification that can provide valid inference when \(p \rightarrow \infty \), possibly along with \(n \rightarrow \infty \). The two indices may sometimes be assumed to grow at the same rate so that \(p/n \rightarrow c \in (0, \infty )\). Alternatively, a sequential asymptotics, first letting \(p \rightarrow \infty \) followed by \(n \rightarrow \infty \), may be considered under which conditions like \(p/n \rightarrow c\) may be dispensed with.

Note that, \(\widehat{\varvec{\Sigma }}\) may be ill-conditioned for \(p < n\) whence \((\cdot )^{-1}\) can be replaced with Moor–Penrose inverse (see e.g., Duchesne and Francq 2015). For \(p \gg n\), this approach is unreliable and inefficient. An alternative is to remove \(\widehat{\varvec{\Sigma }}^{-1}\) from \(T^2\) and consider the Euclidean distance \(Q = \overline{\mathbf{X}}'\overline{\mathbf{X}} = \Vert \overline{\mathbf{X}}\Vert ^2\). An interesting consequence of this can be witnessed by a simple split of Q as

$$\begin{aligned} Q = \frac{1}{n^2}\sum _{k = 1}^n\sum _{r = 1}^n\mathbf{X}'_k\mathbf{X}_r ~=~ \frac{1}{n^2}\sum _{k = 1}^n\mathbf{X}'_k\mathbf{X}_k + \frac{1}{n^2}\underset{k \ne r}{\sum _{k = 1}^n\sum _{r = 1}^n}\mathbf{X}'_k\mathbf{X}_r = Q_1 + U_n, \end{aligned}$$
(1)

where \(Q_1 = (E - U_n)/n\) with \(E = \sum _{k = 1}^n\mathbf{X}'_k\mathbf{X}_k/n\) and \(U_n = \sum _{k \ne r}^n\mathbf{X}'_k\mathbf{X}_r/n(n - 1)\). Note that, E is an average of quadratic forms, and \(U_n\) is an average of bilinear forms composed of independent components. It is shown below that the limiting distribution of the statistic mainly follows from \(U_n\) where \(Q_1\) converges in probability to a constant. With \(E(\mathbf{X}'_k\mathbf{X}_k) = {{\mathrm{tr}}}({\varvec{\Sigma }}) + {\varvec{\mu }}'{\varvec{\mu }}\), \(E(\mathbf{X}'_k\mathbf{X}_r) = {\varvec{\mu }}'{\varvec{\mu }}\), we get \(E(Q_1) = {{\mathrm{tr}}}({\varvec{\Sigma }})/n\), \(E(U_n) = {\varvec{\mu }}'{\varvec{\mu }}\). Thus,

$$\begin{aligned} E(Q) = {{\mathrm{tr}}}({\varvec{\Sigma }})/n + \Vert {\varvec{\mu }}\Vert ^2, \end{aligned}$$
(2)

which is \({{\mathrm{tr}}}({\varvec{\Sigma }})/n\) under \(H_0\). We observe a few salient features of this bifurcation of Q. First, \(E(Q_1) = {{\mathrm{tr}}}({\varvec{\Sigma }})/n = {{\mathrm{Cov}}}(\overline{\mathbf{X}})\) implies that the removal of the inverse of the estimator of \({{\mathrm{Cov}}}(\overline{\mathbf{X}})\) results into a bias term composed of the trace of the same estimator, since it can be verified that \(Q_1 = {{\mathrm{tr}}}(\widehat{\varvec{\Sigma }})/n\) or \(\mathbf{E}_1 - \mathbf{Q}_0 = \widehat{\varvec{\Sigma }}\) with \(\mathbf{E}_1 = \sum _{k = 1}^n\mathbf{X}_k\mathbf{X}'_k/n\) and \(\mathbf{Q}_0 = \sum _{k \ne r}^n\mathbf{X}_k\mathbf{X}'_r/n(n - 1)\) as matrix versions of E and \(U_n\). Note also that \(Q_1\) is independent of \({\varvec{\mu }}\), and \(U_n\) is independent of \({\varvec{\Sigma }}\), under both \(H_0\) and \(H_1\). Now, \(E(U_n) = \Vert {\varvec{\mu }}\Vert ^2\) which is 0 under \(H_0\). Together, the last two facts imply that \(U_n\) can be used to construct the modified test statistic for \(H_0\), whereas \(Q_1\) can help compensate for the removal of estimator from the original test statistic. For this, write \(Q = {{\mathrm{tr}}}(\widehat{\varvec{\Sigma }})/n + U_n = Q_1 + U_n\) and by a simple scaling and re-writing, consider the statistic

$$\begin{aligned} T_1 = 1 + \frac{nQ_0}{nQ_1/p}, \end{aligned}$$
(3)

where \(Q_0 = U_n/p\) is \(U_n\), but with kernel normed by p, \(h(\mathbf{x}_k, \mathbf{x}_r) = \mathbf{X}'_k\mathbf{X}_r/p\). \(T_1\) is the proposed modified statistic for \(H_0: {\varvec{\mu }} = \mathbf{0}\) when \(p \gg n\) and \({\mathcal {F}}\) may be non-normal.

For the limit of \(T_1\), \(nQ_1/p\) is first shown to converge in probability to a constant as \(n, p \rightarrow \infty \). Then, \(nQ_0\) is shown to converge weakly to a normal limit. Under \(H_0\), the kernel of \(U_n\) degenerates so that the null limit follows through a weighted sum of independent \(\chi ^2\) variables. The limit of T follows then by Slutsky’s lemma. As the same scheme will later be extended for \(g \ge 2\), we treat the one-sample case in detail. Let \(\lambda _s\), \(s = 1, \ldots , p\) be the eigenvalues of \({\varvec{\Sigma }}\) so that \(\lambda _s/p = \nu _s\) corresponds to \({\varvec{\Sigma }}/p\). We need the following assumptions.

Assumption 1

\(E(X^4_{ks}) = \gamma _s \le \gamma < \infty \forall ~s = 1, \ldots , p\), \(\gamma _0 \in {\mathbb {R}}^+\).

Assumption 2

\(\lim _{p \rightarrow \infty }\sum _{s = 1}^p\nu _s = \nu _0 \in {\mathbb {R}}^+\).

Assumption 3

\(\lim _{n, p \rightarrow \infty } p/n = c = O(1)\).

Assumption 4

\(\lim _{p \rightarrow \infty } {\varvec{\mu }}'{\varvec{\Sigma }}{\varvec{\mu }}/p = \phi = O(1)\).

Assumption 1 helps us relax normality. By Assumption 2, \(\sum _{s = 1}^p\nu ^2_s = O(1)\), as \(p \rightarrow \infty \). Assumption 4 is only required under \(H_1\). We have the following theorem, proved in “Appendix B.1.”

Theorem 5

For \(T_1\) in Eq. (3), \((T_1 - E(T_1))/\sqrt{{{\mathrm{Var}}}(T_1)}\xrightarrow {{\mathcal {D}}} N(0, 1)\) under Assumptions 14, as \(n, p \rightarrow \infty \), where \(E(T_1)\) and \({{\mathrm{Var}}}(T_1)\) denote the mean and variance of \(T_1\).

From the proof of Theorem 5, \(E(T_1)\) and \({{\mathrm{Var}}}(T_1)\) approximate 1 and \(2{{\mathrm{tr}}}({\varvec{\Xi }}^2)/[{{\mathrm{tr}}}({\varvec{\Xi }})]^2\), respectively, \({\varvec{\Xi }} = n{\varvec{\Sigma }}/p\). As the limit follows from a weighted sum of \(\chi ^2_1\) variables, the moments in fact approximate a scaled Chi-square variable, say \(\chi ^2_f/f\) with moments 1 and 2 / f, where \(f = f_1/f_2\), \(f_1 = [{{\mathrm{tr}}}({\varvec{\Xi }})]^2\), \(f_2 = {{\mathrm{tr}}}({\varvec{\Xi }}^2)\). Thus, to estimate \({{\mathrm{Var}}}(T_1)\), we need consistent estimators of \({{\mathrm{tr}}}({\varvec{\Sigma }}^2)\) and \([{{\mathrm{tr}}}({\varvec{\Sigma }})]^2\). Define \(Q = \sum _{k = 1}^n(\widetilde{\mathbf{X}}'_k\widetilde{\mathbf{X}}_k)^2/(n - 1)\), \(\widetilde{\mathbf{X}} = \mathbf{X}_i - \overline{\mathbf{X}}\), \(\eta = (n - 1)/[n(n - 2)(n - 3)]\). Then, \(E_2 = \eta \{(n - 1)(n - 2)\text {tr}(\widehat{\varvec{\Sigma }}^2) + [\text {tr}(\widehat{\varvec{\Sigma }})]^2 - nQ\}\), \(E_3 = \eta \{2\text {tr}(\widehat{\varvec{\Sigma }}^2) + (n^2 - 3n + 1)[\text {tr}(\widehat{\varvec{\Sigma }})]^2 - nQ\}\) are unbiased and consistent estimators of \({{\mathrm{tr}}}({\varvec{\Sigma }}^2)\) and \([{{\mathrm{tr}}}({\varvec{\Sigma }})]^2\). Then \({\widehat{f}} = {\widehat{f}}_1/{\widehat{f}}_2\) is consistent estimator of f, hence \(\widehat{{{\mathrm{Var}}}(T_1)}\) of \({{\mathrm{Var}}}(T_1)\) such that \(\widehat{{{\mathrm{Var}}}(T_1)}/{{\mathrm{Var}}}(T_1) \rightarrow 1\); see Ahmad (2017b) and end of Sect. 3. We have the following corollary.

Corollary 6

Theorem 5 remains valid when \({{\mathrm{Var}}}(T_1)\) is replaced with \({\widehat{{{\mathrm{Var}}}}}(T_1)\).

Power of\(T_1\) Let \(z_{\alpha }\) be \(100\alpha \%\hbox {th}\) quantile of N(0, 1), \(\beta ({\varvec{\theta }})\) the power function of \(T_1\) with \({\varvec{\theta }} \in {\varvec{\Theta }_0}\) or \({\varvec{\theta }} \in {\varvec{\Theta }}_1\) where \({\varvec{\Theta }}_0 = \{\mathbf{0}\}\), \({\varvec{\Theta }}_1 = {\varvec{\Theta }}\backslash \{\mathbf{0}\}\) are respective parameter spaces under \(H_0\), \(H_1\) with \({\varvec{\Theta }} = {\varvec{\Theta }}\cup {\varvec{\Theta }}_1\), \({\varvec{\Theta }}_0\cap {\varvec{\Theta }}_1 = \phi \). By Theorem 5, \(\beta ({\varvec{\theta }}) = \text {P}(z_1 \ge z_{\alpha })\) with \(\beta ({\varvec{\theta }}|H_0) = \alpha \), \(\beta ({\varvec{\theta }}|H_1) = 1 - \beta \), as \(n, p \rightarrow \infty \), where \(z_1 = (T_1 - E(T_1))/\sqrt{{{\mathrm{Var}}}(T_1)}\). Then, \(1 - \beta = P(z \ge z_\alpha - n\delta )\), \(\delta = \delta _1/\delta _2\), \(\delta _1 = {\varvec{\mu }}'{\varvec{\mu }}/p\), \(\delta ^2_2 = \sum _{s = 1}^p\nu ^2_s\). By the convergence of \(nQ_1/p\), and as \(\delta _1\), \(\delta _2\) are uniformly bounded under the assumptions, \(1 - \beta \rightarrow 1\) as \(n, p \rightarrow \infty \).

Remark 7

A remark on the structure of \(T_1\) is in order. With \([nQ_1/p]/{{\mathrm{tr}}}({\varvec{\Xi }})\) converging in probability to 1, consider \(T_1 = 1 + nU_n/{{\mathrm{tr}}}({\varvec{\Sigma }})\), also ignoring p for convenience. Then, \(E(T_1) = 1 + n\Vert {\varvec{\mu }}\Vert ^2/{{\mathrm{tr}}}({\varvec{\Sigma }}) = 1 + E(\overline{\mathbf{X}})'E(\overline{\mathbf{X}})/{{\mathrm{Cov}}}(\overline{\mathbf{X}})\), where \(E(T_1) = 1\) under \(H_0\). In this sense, \(T_1\) is similar to an F-statistic, where \(T_1\) is close to 1 under \(H_0\) and moves apart as \({\varvec{\mu }}\) deviates from \(\mathbf{0}\). Since \({{\mathrm{Cov}}}(\overline{\mathbf{X}}) + E(\overline{\mathbf{X}})'E(\overline{\mathbf{X}}) = E(\overline{\mathbf{X}}'\overline{\mathbf{X}})\), the partitioning used to define \(T_1\) helps not only adjust for bias term but also makes the resulting statistic computationally much simpler, particularly under non-normality. A similar argument holds for multi-sample tests presented in the next sections.

2.2 The two-sample case

For the multi-sample setup in Sect. 1, let \(g = 2\). We are interested to test \(H_{02}: {\varvec{\mu }}_1 = {\varvec{\mu }}_2\) versus \(H_{12}:\) Not \(H_{02}\). Let \(\overline{\mathbf{X}}_i = \sum _{k = 1}^{n_i}{} \mathbf{X}_{ik}/n_i\) and \(\widehat{\varvec{\Sigma }}_i = \sum _{k = 1}^{n_i}(\mathbf{X}_{ik} - \overline{\mathbf{X}}_i)(\mathbf{X}_{ik} - \overline{\mathbf{X}}_i)'/(n_i - 1)\) be unbiased estimators of \({\varvec{\mu }}_i\) and \({\varvec{\Sigma }}_i\). Denote \(n = n_1 + n_2\). Assuming normality, \({\varvec{\Sigma }}_i = {\varvec{\Sigma }} \forall ~i\) and \(n - 2 > p\), \(H_{02}\) is usually tested by two-sample \(T^2\), \(T^2 = [n_1n_2/n](\overline{\mathbf{X}}_1 - \overline{\mathbf{X}}_2)'\widehat{\varvec{\Sigma }}^{-1}(\overline{\mathbf{X}}_1 - \overline{\mathbf{X}}_2)\), where \(\widehat{\varvec{\Sigma }} = \sum _{i = 1}^2(n_i - 1)\widehat{\varvec{\Sigma }}_i/\sum _{i = 1}^2(n_i - 1)\) is an estimator of common \({\varvec{\Sigma }}\). For \(p > n - 2\) or more generally for \(p > n_i\), \(T^2\) is invalid by the same token as for its one-sample counterpart. We consider a likewise partition of \(Q = \Vert \overline{\mathbf{X}}_1 - \overline{\mathbf{X}}_2\Vert ^2 = \overline{\mathbf{X}}'_{1}\overline{\mathbf{X}}_{1} + \overline{\mathbf{X}}'_{2}\overline{\mathbf{X}}_{2} - 2\overline{\mathbf{X}}'_{1}\overline{\mathbf{X}}_{2}\) as

$$\begin{aligned} Q = \sum _{i = 1}^2\frac{1}{n^2_i}\sum _{k = 1}^{n_i}\sum _{r = 1}^{n_i}{} \mathbf{X}'_{ik}{} \mathbf{X}_{ir} - \frac{2}{n_1n_2}\sum _{k = 1}^{n_1}\sum _{l = 1}^{n_2}{} \mathbf{X}'_{1k}{} \mathbf{X}_{2l} ~=~ Q_1 + U_0 \end{aligned}$$
(4)

with \(Q_1 = \sum _{i = 1}^2Q_{i1} = \sum _{i = 1}^2{{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_i)/n_i = {{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_0)\), \(Q_{i1} = (E_i - U_{n_i})/n_i = {{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_i)/n_i\), \(U_0 = \sum _{i = 1}^2U_{n_i} - 2U_{n_1n_2}\), where \(E_i = \sum _{k = 1}^{n_i}\mathbf{X}'_{ik}{} \mathbf{X}_{ik}/n_i\) and

$$\begin{aligned} U_{n_i} = \frac{1}{n_i(n_i - 1)}\underset{k \ne r}{\sum _{k = 1}^{n_i}\sum _{r = 1}^{n_i}}{} \mathbf{X}'_{ik}{} \mathbf{X}_{ir}, \quad U_{n_1n_2} = \frac{1}{n_1n_2}\sum _{k = 1}^{n_1}\sum _{l = 1}^{n_2}{} \mathbf{X}'_{1k}{} \mathbf{X}_{2l}, \end{aligned}$$
(5)

are one- and two-sample U-statistics, respectively, with symmetric kernels as bilinear forms of independent vectors. As in the one-sample case, \(E(Q_{i1}) = {{\mathrm{tr}}}({\varvec{\Sigma }}_i)/n_i \Rightarrow E(Q_1) = {{\mathrm{tr}}}({\varvec{\Sigma }}_0)\), \({\varvec{\Sigma }}_0 = \sum _{i = 1}^2{\varvec{\Sigma }}_i/n_i\) and \(E(U_0) = \Vert {\varvec{\mu }}_1 - {\varvec{\mu }}_2\Vert ^2\) which vanishes under \(H_{02}\). Thus,

$$\begin{aligned} E(Q) = {{\mathrm{tr}}}({\varvec{\Sigma }}_0) + \Vert {\varvec{\mu }}_1 - {\varvec{\mu }}_2\Vert ^2 ~=~ {{\mathrm{tr}}}({\varvec{\Sigma }}_0)~ \text {under}~H_{02}. \end{aligned}$$
(6)

Again, \(E(Q_1)\) is independent of \({\varvec{\mu }}_i\), and \(E(U_0)\) is independent of \({\varvec{\Sigma }}_i\), under \(H_{02}\) and \(H_{12}\). Further, \(E(Q_1) = {{\mathrm{tr}}}({\varvec{\Sigma }}_0)\), \({\varvec{\Sigma }}_0 = {{\mathrm{Cov}}}(\overline{\mathbf{X}}_1 - \overline{\mathbf{X}}_2)\). We thus extend \(T_1\) in Eq. (3) for \(H_{02}\) as

$$\begin{aligned} T_2 = 1 + \frac{nQ_0}{[nQ_1/p]}, \end{aligned}$$
(7)

where \(Q_0 = U_0/p\) is \(U_0\) with kernels of \(U_{n_i}\) and \(U_{n_1n_2}\) scaled by p, i.e., \(h(\mathbf{x}_k, \mathbf{x}_r) = \mathbf{X}'_{ik}{} \mathbf{X}_{ir}/p\) and \(h(\mathbf{x}_k, \mathbf{x}_l) = \mathbf{X}'_{1k}{} \mathbf{X}_{2l}/p\), respectively. Following assumptions extend those of one-sample case, where \(\nu _{is} = \lambda _{is}/p\) are eigenvalues of \({\varvec{\Xi }}_i = {\varvec{\Sigma }}_i/p\), \(i = 1, 2\).

Assumption 8

\(E(X^4_{iks}) = \gamma _{is} \le \gamma < \infty \forall ~s = 1, \ldots , p\), \(i = 1, \ldots , g\), \(\gamma \in {\mathbb {R}}^+\).

Assumption 9

\(\lim _{p \rightarrow \infty }\sum _{i = 1}^p\nu _{is} = \sum _{s = 1}^\infty \nu _{is} = \nu _{i0} \in {\mathbb {R}}^+\), \(i = 1, \ldots , g\).

Assumption 10

\(\lim _{n_i, p \rightarrow \infty } p/n_i = c_i = O(1)\), \(i = 1, \ldots , g\).

Assumption 11

\(\lim _{n_i \rightarrow \infty } n/n_i = \rho _i = O(1)\), \(i = 1, \ldots , g\).

Assumption 12

\(\lim _{p \rightarrow \infty } {\varvec{\mu }}'_i{\varvec{\Sigma }}_k{\varvec{\mu }}_j/p = \phi _{ijk} \le \phi = O(1)\), \(i, j, k = 1, \ldots , g\).

As the same assumptions will be used in Sect. 3, they are stated for \(g \ge 2\). Assumption 11 is additional to those for one-sample case. It is needed to keep the limit non-degenerate when \(n_i \rightarrow \infty \), \(n = \sum _{i = 1}^gn_i\). Assumption 12 is again needed only under \(H_{12}\). Following theorem, proved in “Appendix B.2”, extends Theorem 5 to two-sample case.

Theorem 13

For \(T_2\) in Eq. (7), \((T_2 - E(T_2))/\sqrt{{{\mathrm{Var}}}(T_2)} \xrightarrow {{\mathcal {D}}} N(0, 1)\) under Assumptions 812, as \(n_i, p \rightarrow \infty \), where \(E(T_2)\) and \({{\mathrm{Var}}}(T_2)\) denote the mean and variance of \(T_2\).

It is interesting to see how the limit for degenerate case sums up. With \(\nu _0\) as the limit of \(nQ_1/p\), it follows from (21) and (22) that (see e.g., Anderson et al. 1994)

$$\begin{aligned}&nQ_0~\xrightarrow {{\mathcal {D}}}~\sum _{s = 1}^\infty \left( \sqrt{\rho _1\nu _{1s}}z_{1s} - \sqrt{\rho _2\nu _{2s}}z_{2s}\right) ^2 - \nu _0 \\&\quad \Rightarrow ~ T_2 ~\xrightarrow {{\mathcal {D}}}~ \sum _{s = 1}^\infty \left( \sqrt{\rho _1\nu _{1s}}z_{1s} - \sqrt{\rho _2\nu _{2s}}z_{2s}\right) ^2/\nu _0, \end{aligned}$$

with 1 and \(2\sum _{s = 1}^\infty (\rho _1\nu _{1s} - \rho _1\nu _{2s})^2/\nu ^2_0\) as limiting mean and variance, where the variance approximates \(2{{\mathrm{tr}}}({\varvec{\Xi }}^2)/[{{\mathrm{tr}}}({\varvec{\Xi }})]^2\), \({\varvec{\Xi }} = n{\varvec{\Sigma }}_0/p\), \({\varvec{\Sigma }}_0 = \sum _{i = 1}^2{\varvec{\Sigma }}_i/n_i\). By the same argument of a scaled Chi-square approximation as for one-sample case, the moments correspond to those of \(\chi ^2_f/f\), i.e., 1 and 2 / f, \(f = f_1/f_2\), \(f_1 = [{{\mathrm{tr}}}({\varvec{\Xi }}_0)]^2\), \(f_2 = {{\mathrm{tr}}}({\varvec{\Xi }}^2_0)\). Let \(E_{2i}\) = \(\eta _i\{(n_i - 1)(n_i - 2)\text {tr}(\widehat{\varvec{\Sigma }}^2_i)\) + \([\text {tr}(\widehat{\varvec{\Sigma }}_i)]^2 - n_iQ_i\}\), \(E_{3i}\) = \(\eta _i\{2\text {tr}(\widehat{\varvec{\Sigma }}^2_i)\) + \((n^2_i - 3n_i + 1)[\text {tr}(\widehat{\varvec{\Sigma }}_i)]^2 - n_iQ_i\}\), where \(Q_i = \sum _{k = 1}^{n_i}(\widetilde{\mathbf{X}}'_{ik}\widetilde{\mathbf{X}}_{ik})^2/(n_i - 1)\), \(\widetilde{\mathbf{X}}_i = \mathbf{X}_{ik} - \overline{\mathbf{X}}_i\), \(\eta _i = (n_i - 1)/[n_i(n_i - 2)(n_i - 3)]\). Further, by independence, \({{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_1\widehat{\varvec{\Sigma }}_2)\) is an unbiased and consistent estimator of \({{\mathrm{tr}}}({\varvec{\Sigma }}_1{\varvec{\Sigma }}_2)\). Plugging in \(f_1\), \(f_2\) leads to a consistent estimator of f, hence of \({{\mathrm{Var}}}(T_2)\), i.e., \({\widehat{{{\mathrm{Var}}}}}(T_2)\). We have the following corollary.

Corollary 14

Theorem 13 remains valid when \({{\mathrm{Var}}}(T_2)\) is replaced with \({\widehat{{{\mathrm{Var}}}}}(T_2)\).

Remark 15

Due to its special practical value, the two-sample test has been investigated the most, also for high-dimensional case. We briefly discuss three tests, most closely related to \(T_2\). Denote \(\kappa = n_1n_2/n\), \(\omega _1 = (n - 1)/(n - 2)\), \(\omega _2 = (n - 2)^2/n(n - 1)\), \(n = n_1 + n_2\). Let \(\xi = \Vert \overline{\mathbf{X}}_1 - \overline{\mathbf{X}}_2\Vert ^2 - {{\mathrm{tr}}}(\widehat{\varvec{\Sigma }})/\kappa \), where \(\widehat{\varvec{\Sigma }}\) is the pooled estimator of common \({\varvec{\Sigma }}\) as given in the context of \(T^2\) above.

Dempster (1958) proposed the first two-sample test for high-dimensional data under normality, motivated by a problem put forth by his colleagues (see Sect. 5). The test, in simpler form, is given as \(T_D = \Vert \overline{\mathbf{X}}_1 - \overline{\mathbf{X}}_2\Vert ^2/\kappa {{\mathrm{tr}}}(\widehat{\varvec{\Sigma }})\). An alternative form of \(T_D\) follows by partitioning the norm in the numerator into several independent quadratic forms using an orthonormal transformation, so that the test follows an approximate F distribution with degrees of freedom estimated using a scaled Chi-square distribution. See also Dempster (1960, 1968) for details, where Bai and Saranadasa (1996) give a detailed evaluation of the approximation and power of Dempster’s test.

Bai and Saranadasa (1996)’s test, \(T_\text {BS} = \kappa \xi /\sqrt{2\omega _1}B\), is a standardization of \(\xi \) under homoscedasticity, where \(B^2 = \omega _2\{{{\mathrm{tr}}}(\widehat{\varvec{\Sigma }})^2 - [{{\mathrm{tr}}}(\widehat{\varvec{\Sigma }})]^2/n\}\). Chen and Qin (2010)’s test, \(T_\text {CQ}\), is a standardization of \(U_0 = \sum _{i = 1}^2U_{n_i} - 2U_{n_1n_2}\); see (4). \(T_\text {CQ}\) is based on the same model used for \(T_\text {BS}\) but relaxing normality and homoscedasticity. From the partition of Q in (4), it follows that, under the assumption of homoscedasticity, \(T_\text {D}\) divides the norm by the biased term, where \(T_\text {BS}\) and \(T_\text {CQ}\) subtract the same bias term from the norm, so that the numerator in both tests is \(U_0\) with \(E(U_0) = \Vert {\varvec{\mu }}_1 - {\varvec{\mu }}_2\Vert ^2 = 0\) under \(H_0\), where for \({\varvec{\Sigma }}_i = {\varvec{\Sigma }}\), \(i = 1, 2\), both tests coincide.

The proposed test, \(T_g\), \(g \ge 1\), differs from both in that it uses the removed bias term to rescale the test, where it neither requires normality nor homoscedasticity assumption. Note that, \(T_\text {CQ}\) is also defined without the two assumptions, but the bias adjustment, assumptions and computation of variance of the statistic are reasonably different for the two tests.

To get a more precise idea on the comparison of these tests, we did a simulation study to assess their test sizes and power. Two independent random samples of iid vectors of sizes (\(n_1\), \(n_2\)), \(n_1 \in \{10, 20, 50\}\), \(n_2 = 2n_1\), each of dimension \(p \in \{50, 100, 300, 500\}\), are generated from normal, \(t_7\) and Unif[0, 1] distributions with covariance matrices, \({\varvec{\Sigma }}_i\), i = 1, 2, compound symmetry, CS, and autoregressive of order 1, AR(1). The CS and AR(1) are defined, respectively, as \(\kappa \mathbf{I} + \rho \mathbf{J}\) and \(\text {Cov}(X_k, X_l) = \kappa \rho ^{|k - l|}\), \(\forall ~ k, l\), with \(\mathbf{I}\) as identity matrix and \(\mathbf{J}\) a matrix of 1s. For size, we pair \({\varvec{\Sigma }}_i\) for the two populations: both \({\varvec{\Sigma }}_1\) and \({\varvec{\Sigma }}_2\) CS with \(\rho = 0.5\) and \(\rho = 0.8\), respectively; \({\varvec{\Sigma }}_1\) as CS, \({\varvec{\Sigma }}_2\) as AR(1), both with \(\rho = 0.5\). For power, we use CS with \(\rho \) = 0.4 and 0.8. We take \(\kappa = 1\) for all cases. For brevity, power results are only reported for p = 100, for normal and t distributions.

Table 1 reports estimated test sizes of \(T_2\), \(T_\text {BS}\) and \(T_\text {CQ}\) for all distributions with both pairs of \({\varvec{\Sigma }}_i\). We observe an accurate performance of \(T_2\) for all parameters, whereas \(T_\text {BS}\) and \(T_\text {CQ}\) prove, respectively, to be very liberal and very conservative, with their performance at least not improving with increasing p or (particularly) increasing n. Note that, the inaccuracy of \(T_\text {BS}\) can be justified as it may pertain to the homoscedasticity assumption the test is based on and which is violated in the simulations. The performance of \(T_\text {CQ}\), on the other hand, can be ascribed to its assumptions, particularly on the vanishing of trace ratios such as \({{\mathrm{tr}}}({\varvec{\Sigma }}^4)/[{{\mathrm{tr}}}({\varvec{\Sigma }}^2)]^2\), \({{\mathrm{tr}}}({\varvec{\Sigma }}^2)/[{{\mathrm{tr}}}({\varvec{\Sigma }})]^2\) and \({{\mathrm{tr}}}({\varvec{\Sigma }}^3)/{{\mathrm{tr}}}({\varvec{\Sigma }}){{\mathrm{tr}}}({\varvec{\Sigma }}^2)\), which are not satisfied for certain covariance structures, e.g., compound symmetric. A discussion on \(T_g\) is adjourned for Sect. 4, where it is evaluated in more detail.

Table 1 Estimated test size for \(T_2\), \(T_\text {BS}\) and \(T_\text {CQ}\) for three distributions with unequal covariance matrices
Fig. 1
figure 1

Power curves of \(T_2\), \(T_{\text {BS}}\) and \(T_{\text {CQ}}\) for normal (upper) and t (lower) distributions with (L to R) \((n_1, n_2) = (10, 20), (20, 40), (50, 100)\), \(p = 100\) and CS structures with \(\rho = 0.4\) and 0.8

From Fig. 1, we also observe power of \(T_2\) higher than its competitors where the curves come closer with increasing non-centrality parameter as well as with increasing sample sizes, and this phenomenon is very similar for both distributions. Generally, a similar comparative performance and effect of sample sizes are observed for different p values; hence, not all are reported here.

3 Multi-sample test: one-way MANOVA

Here, we extend \(T_2\) to the general case, \(g \ge 2\). As usual, \(\overline{\mathbf{X}}_i\) and \(\widehat{\varvec{\Sigma }}_i\) are unbiased estimators of \({\varvec{\mu }}_i\), \({\varvec{\Sigma }}_i\, i = 1, \ldots , g\). Recall \(T_2\) in (7) as a modification of \(T^2\) using the Euclidean distance \(\Vert \overline{\mathbf{X}}_1 - \overline{\mathbf{X}}_2\Vert ^2\). For \(H_{0g}\), we sum over all pairwise norms, \(\sum _{i< j}\Vert \overline{\mathbf{X}}_i - \overline{\mathbf{X}}_j\Vert ^2 = \sum _{i< j}(E_i - U_{n_i})/n_i + \sum _{i< j}(U_{n_i} + U_{n_j} - 2U_{n_in_j}) = (g - 1)\sum _{i = 1}^g{{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_i)/n_i + (g - 1)\sum _{i = 1}^gU_{n_i} - 2\sum _{i < j}^gU_{n_in_j}\), and define the MANOVA statistic as

$$\begin{aligned} T_g = (g - 1) + \frac{nQ_0}{[nQ_1/p]}, \end{aligned}$$
(8)

\(Q_1 = \sum _{i = 1}^gQ_{i1}\), \(Q_{i1} = (E_i - U_{n_i})/n_i = {{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_i)/n_i\), \(E_i = \sum _{k = 1}^{n_i}\mathbf{X}'_{ik}{} \mathbf{X}_{ik}/n_i\), \(Q_0 = \sum _{i < j}Q_{0ij}\), \(Q_{0ij} = U_{n_i} + U_{n_j} - 2U_{n_in_j}\), where \(U_{n_i}\), \(U_{n_in_j}\) are as defined in (5) with kernels \(h(\mathbf{x}_{ik}, \mathbf{x}_{ir}) = \mathbf{X}'_{ik}{} \mathbf{X}_{ir}\), \(k \ne r\), \(h(\mathbf{x}_{ik}, \mathbf{x}_{jl}) = \mathbf{X}'_{ik}{} \mathbf{X}_{jl}\), \(i \ne j\), \(k, r, l = 1, \ldots , n_i\), \(i, j = 1, \ldots , g\), \(n = \sum _{i = 1}^gn_i\). Further, \(Q_1 = {{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_0)\), \(\widehat{\varvec{\Sigma }}_0 = \sum _{i = 1}^g\widehat{\varvec{\Sigma }}_i/n_i\), is an unbiased estimator of \({{\mathrm{tr}}}({\varvec{\Sigma }}_0)\), \({\varvec{\Sigma }}_0 = \sum _{i = 1}^g{\varvec{\Sigma }}_i/n_i\). We begin with the moments of \(Q_0\). In particular,

$$\begin{aligned} {{\mathrm{Var}}}(Q_0)= & {} \underset{i \ne j}{\sum _{i = 1}^g\sum _{j = 1}^g{{\mathrm{Var}}}}(Q_{0ij}) + \underset{(i, j) \ne (i', j')}{\sum _{i = 1}^g\sum _{i' = 1}^g\sum _{j = 1}^g\sum _{j' = 1}^g}{{\mathrm{Cov}}}(Q_{0ij}, Q_{0i'j'})\nonumber \\= & {} (g - 1)^2\sum _{i = 1}^g{{\mathrm{Var}}}(U_{n_i}) + 4\underset{i< j}{\sum _{i = 1}^g\sum _{j = 1}^g}{{\mathrm{Var}}}(U_{n_in_j}) \nonumber \\&+~8\underset{i< j< j'}{\sum _{i = 1}^g\sum _{j = 1}^g\sum _{j' = 1}^g}{{\mathrm{Cov}}}(U_{n_in_j}, U_{n_in_{j'}})\nonumber \\&+ 8\underset{i< i'< j'}{\sum _{i = 1}^g\sum _{j = 1}^g\sum _{j' = 1}^g}{{\mathrm{Cov}}}(U_{n_in_j}, U_{n_{i'}n_j}) \nonumber \\&+~8\underset{i< i'< j'}{\sum _{i = 1}^g\sum _{j = 1}^g\sum _{j' = 1}^g}{{\mathrm{Cov}}}(U_{n_in_j}, U_{n_{i'}n_{j'}}) \nonumber \\&-\, 4(g - 1)\underset{i< j< j'}{\sum _{i = 1}^g\sum _{j = 1}^g\sum _{j' = 1}^g}{{\mathrm{Cov}}}(U_{n_i}, U_{n_in_j})\nonumber \\&-~4(g - 1)\underset{i< j< j'}{\sum _{i = 1}^g\sum _{j = 1}^g\sum _{j' = 1}^g}{{\mathrm{Cov}}}(U_{n_j}, U_{n_in_j})\nonumber \\&-\, 4(g - 1)\underset{i< j < j'}{\sum _{i = 1}^g\sum _{j = 1}^g\sum _{j' = 1}^g}{{\mathrm{Cov}}}(U_{n_i}, U_{n_jn_{j'}}) \end{aligned}$$

where the covariances vanish when \(i \ne i', j \ne j'\). Denoting \({\varvec{\Sigma }}_{0ij} = {\varvec{\Sigma }}_i/n_i + {\varvec{\Sigma }}_j/n_j\), \(i < j\), and using the moments of U-statistics from Sect. A.2, we obtain

$$\begin{aligned} {{\mathrm{Var}}}(Q_{0ij})= & {} \frac{2}{p^2}{{\mathrm{tr}}}({\varvec{\Sigma }}^2_{0ij}) + \frac{4}{p^2}({\varvec{\mu }}_i - {\varvec{\mu }}_j)'{\varvec{\Sigma }}_{0ij}({\varvec{\mu }}_i - {\varvec{\mu }}_j) \end{aligned}$$
(9)
$$\begin{aligned} {{\mathrm{Cov}}}(Q_{0ij}, Q_{0ij'})= & {} \frac{2}{n^2_ip^2}{{\mathrm{tr}}}({\varvec{\Sigma }}^2_i) + \frac{4}{n_ip^2}({\varvec{\mu }}_i - {\varvec{\mu }}_j)'{\varvec{\Sigma }}_i({\varvec{\mu }}_i - {\varvec{\mu }}_{j'}) \end{aligned}$$
(10)
$$\begin{aligned} {{\mathrm{Cov}}}(Q_{0ij}, Q_{0i'j})= & {} \frac{2}{n^2_jp^2}{{\mathrm{tr}}}({\varvec{\Sigma }}^2_j) + \frac{4}{n_jp^2}({\varvec{\mu }}_i - {\varvec{\mu }}_j)'{\varvec{\Sigma }}_j({\varvec{\mu }}_{i'} - {\varvec{\mu }}_j). \end{aligned}$$
(11)

Theorem 16 summarizes the moments which reduce to those of two-sample case for \(g = 2\).

Theorem 16

For \(Q_0\) defined above, we have

$$\begin{aligned} E(Q_0)= & {} \frac{1}{p}\underset{i< j}{\sum _{i = 1}^g\sum _{j = 1}^g}\Vert {\varvec{\mu }}_i - {\varvec{\mu }}_j\Vert ^2\\ {{\mathrm{Var}}}(Q_0)= & {} \frac{1}{p^2}\left[ 2(g - 1)^2\sum _{i = 1}^g\frac{{{\mathrm{tr}}}({\varvec{\Sigma }}^2_i)}{n^2_i} + 4\underset{i< j}{\sum _{i = 1}^g\sum _{j = 1}^g}\frac{{{\mathrm{tr}}}({\varvec{\Sigma }}_i{\varvec{\Sigma }}_j)}{n_in_j}\right. \\&+~4\left. \underset{i< j}{\sum _{i = 1}^g\sum _{j = 1}^g}({\varvec{\mu }}_i - {\varvec{\mu }}_j)'{\varvec{\Sigma }}_{0ij}({\varvec{\mu }}_i - {\varvec{\mu }}_j) \right. \\&\left. + 4\underset{i< j < j'}{\sum _{i = 1}^g\sum _{j = 1}^g\sum _{j' = 1}^g}({\varvec{\mu }}_i - {\varvec{\mu }}_j)'{\varvec{\Sigma }}_{0ij}({\varvec{\mu }}_i - {\varvec{\mu }}_{j'})\right] \end{aligned}$$

Now, consider the limit of \(T_g\) under Assumptions 812. By the independence of g samples, the convergence of \(Q_1\) follows exactly as for \(g = 2\) so that, as \(n_i, p \rightarrow \infty \),

$$\begin{aligned} nQ_1/p ~\xrightarrow {{\mathcal {P}}}~ \nu _0, \end{aligned}$$

where \(\nu _0 = \sum _{i = 1}^g\rho _i\nu _{i0} = \sum _{i = 1}^g\sum _{s = 1}^\infty \rho _i\nu _{is}\). For the limit of \(Q_0\), we note, from the formulation \((g - 1)\sum _{i = 1}^gU_{n_i} - 2\sum _{i < j}^gU_{n_in_j}\) and by the independence of \(U_{n_i}\), \(U_{n_j}\), \(i \ne j\), which we need the distribution of \(\mathbf{U}_N = (U_{n_i}, U_{n_in_{i + 1}}, \ldots , U_{n_in_g})'\), \(i = 1, \ldots , g - 1\). Alternatively, we can consider \(\mathbf{Q}_0 = (Q_{012}, \ldots , Q_{0g-1,g})'\). For \(g = 2\), \(\mathbf{U}_N = (U_{n_1}, ~U_{n_1n_2})'\), \(\mathbf{Q}_0 = Q_{012}\). We can use either of the two options and proceed as for \(g = 2\). \(\mathbf{Q}_0\) is a \(G\times 1\) vector, \(G = g(g - 1)/2\), with \({{\mathrm{Cov}}}(\mathbf{Q}_0)\) a \(G\times G\) partitioned matrix \({\varvec{\Lambda }} = ({\varvec{\Lambda }}_{ij}/p^2)_{i, j = 1}^G\) where

$$\begin{aligned} {\varvec{\Lambda }} ~=~ \begin{pmatrix} {\varvec{\Lambda }}_{11} &{}\quad {\varvec{\Lambda }}_{12} &{}\quad \ldots &{}\quad {\varvec{\Lambda }}_{1g}\\ {\varvec{\Lambda }}_{21} &{}\quad {\varvec{\Lambda }}_{22} &{}\quad \ldots &{}\quad {\varvec{\Lambda }}_{2g}\\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ {\varvec{\Lambda }}_{g1} &{}\quad {\varvec{\Lambda }}_{12} &{}\quad \ldots &{}\quad {\varvec{\Lambda }}_{gg} \end{pmatrix}. \end{aligned}$$
(12)

Thus, \({\varvec{\Lambda }}_{ii}/p^2 = {{\mathrm{Cov}}}(\mathbf{Q}_{0i})\): \((g - i)\times (g - i)\), and \({\varvec{\Lambda }}_{ij}/p^2 = {{\mathrm{Cov}}}(\mathbf{Q}_{0i}, \mathbf{Q}_{0j})\): \((g - i)\times (g - j)\), \({\varvec{\Lambda }}_{ji} = {\varvec{\Lambda }}'_{ij}\), \(i = 1, \ldots , g-1\), \(j = i+1, \ldots , g\). Denote \(a_i = {{\mathrm{tr}}}({\varvec{\Sigma }}^2_i/n^2_i)\), \(a_{0ij} = {{\mathrm{tr}}}({\varvec{\Sigma }}^2_{0ij})\). Then \({\varvec{\Lambda }}_{ii} = 2(\oplus _{j = i + 1}^ga_{0ij} + (\mathbf{J} - \mathbf{I})_{g-i}a_i)/p^2\), \({\varvec{\Lambda }}_{ij} = 2(\mathbf{0}' ~~ \mathbf{1}'_{g-i}a_i ~~ \oplus _{j = i+2}^ga_j)'/p^2\), where \(\mathbf{1}\) is vector of 1s, \(\mathbf{J} = \mathbf{11}'\), \(\mathbf{I}\) is identity matrix, \(\oplus \) is Kronecker sum and \(\mathbf{0}\) in \({\varvec{\Lambda }}_{ij}\) is \((j-i-1)\times (g-j)\) with no \(\mathbf{0}\) if \(j - i - 1 = 0\). For any i, \({\varvec{\Lambda }}_{ii}\) has same off-diagonal element \(a_i\) with diagonal elements \(a_{0ij} = {{\mathrm{tr}}}({\varvec{\Sigma }}^2_{0ij})\), \({\varvec{\Sigma }}_{0ij} = {\varvec{\Sigma }}_i/n_i + {\varvec{\Sigma }}_j/n_j = {{\mathrm{Cov}}}(\overline{\mathbf{X}}_i - \overline{\mathbf{X}}_j)\), \(j = i +1\). Further, most off-diagonals in \({\varvec{\Lambda }}_{ij}\) are 0, and the number of (rows with) zeros increases with j for every i, making \({\varvec{\Lambda }}\) an increasingly sparse matrix.

The weak convergence holds for \(Q_{0ij}\) for any (ij) in \(\mathbf{Q}_0\), and we only need to take care of the nonzero off-diagonal elements in \({\varvec{\Lambda }}\), i.e., \(a_i/p^2\), which are uniformly bounded under the assumptions and same holds for Eqs. (9)–(11). The limit of \(n\mathbf{Q}_0\), hence of \(nQ_0\), follows then as of \(\mathbf{U}_N\) for \(g = 2\). Finally, Slutsky’s lemma gives the limit of \(T_g\). For the limit under \(H_{0g}\), \(E(\mathbf{Q}_0) = \mathbf{0}\), all covariances of U-statistics vanish (Sect. A.2) and Eqs. (9)–(11) reduce to \({{\mathrm{Var}}}(Q_{0ij}) = 2{{\mathrm{tr}}}({\varvec{\Sigma }}^2_{0ij})\), \({{\mathrm{Cov}}}(Q_{0ij}, Q_{0ij'}) = 2{{\mathrm{tr}}}({\varvec{\Sigma }}^2_i)/n^2_i\), \({{\mathrm{Cov}}}(Q_{0ij}, Q_{0i'j}) = 2{{\mathrm{tr}}}({\varvec{\Sigma }}^2_j)/n^2_j\), which are independent of \({\varvec{\mu }}_i\), so that we continue to assume \({\varvec{\mu }}_i = \mathbf{0}\forall ~i\). In particular, from Theorem 16, \(E(Q_0) = 0\) under \(H_{0g}\) and

$$\begin{aligned} {{\mathrm{Var}}}(Q_0) = \frac{1}{p^2}\left[ 2(g - 1)^2\sum _{i = 1}^g\frac{{{\mathrm{tr}}}({\varvec{\Sigma }}^2_i)}{n^2_i} + 4\underset{i < j}{\sum _{i = 1}^g\sum _{j = 1}^g}\frac{{{\mathrm{tr}}}({\varvec{\Sigma }}_i{\varvec{\Sigma }}_j)}{n_in_j}\right] , \end{aligned}$$

which is \(2{{\mathrm{tr}}}({\varvec{\Sigma }}^2_{012})\) for \(g = 2\); see Eq. (19). The null limit then also follows on the same lines as for \(g = 2\). The following theorem generalizes Theorem 13 for \(g \ge 2\) samples.

Theorem 17

For \(T_g\) in Eq. (8), \((T_g - E(T_g))/\sqrt{{{\mathrm{Var}}}(T_g)}\xrightarrow {{\mathcal {D}}} N(0, 1)\) under Assumptions 812, as \(n_i, p \rightarrow \infty \), where \(E(T_g)\) and \({{\mathrm{Var}}}(T_g)\) denote the mean and variance of \(T_g\).

For the moments of \(T_g\), note that the general distribution follows from the projection \({\widehat{Q}}_0 = \sum _{i< j}^g{\widehat{Q}}_{0ij} = \sum _{i < 1}^g({\widehat{U}}_{n_i} - 2{\widehat{U}}_{n_in_j})\) of \({\widetilde{Q}} = \sum _{i < 1}^g{\widetilde{Q}}_{0ij} {\widetilde{Q}}_{0ij} = Q_0 - E(Q_0)\), so that

$$\begin{aligned} E({\widehat{Q}}_0)= & {} \frac{1}{p}\underset{i< j}{\sum _{i = 1}^g\sum _{j = 1}^g}\Vert {\varvec{\mu }}_i - {\varvec{\mu }}_j\Vert ^2 \\ {{\mathrm{Var}}}({\widehat{Q}}_0)= & {} \frac{4}{p^2}\left[ \underset{i< j}{\sum _{i = 1}^g\sum _{j = 1}^g}({\varvec{\mu }}_i - {\varvec{\mu }}_j)'{\varvec{\Sigma }}_{0ij}({\varvec{\mu }}_i - {\varvec{\mu }}_j)\right. \\&\left. + 4\underset{i< j < j'}{\sum _{i = 1}^g\sum _{j = 1}^g\sum _{j' = 1}^g}({\varvec{\mu }}_i - {\varvec{\mu }}_j)'{\varvec{\Sigma }}_{0ij}({\varvec{\mu }}_i - {\varvec{\mu }}_{j'})\right] . \end{aligned}$$

Likewise, under \(H_{0g}\), the convergence of degenerate \(U_{n_i}\) and \(U_{n_in_j}\) gives

$$\begin{aligned} nQ_0 \xrightarrow {{\mathcal {D}}} \underset{i < j}{\sum _{i = 1}^g\sum _{j = 1}^g}\sum _{s = 1}^\infty \left( \sqrt{\rho _i\nu _{is}}z_{is} - \sqrt{\rho _j\nu _{js}}z_{js}\right) ^2 \end{aligned}$$

such that the limiting moments \(E(nQ_0) = 0\) and \({{\mathrm{Var}}}(nQ_0) = 2\sum _{i < 1}^g\sum _{s = 1}^\infty (\rho _i\nu _{is} + \rho _j\nu _{js})^2\) approximate exact moments of \(Q_0\) under \(H_{0g}\). Combined with the limit of \(nQ_1/p\), it gives

$$\begin{aligned} T_g ~\xrightarrow {{\mathcal {D}}}~ \frac{1}{\nu _0}\underset{i < j}{\sum _{i = 1}^g\sum _{j = 1}^g}\sum _{s = 1}^\infty \left( \sqrt{\rho _i\nu _{is}}z_{is} - \sqrt{\rho _j\nu _{js}}z_{js}\right) ^2, \end{aligned}$$
(13)

with \(E(T_g) = g - 1\) and variance \({{\mathrm{Var}}}(T_g) = 2\sum _{i < j}^g\sum _{s = 1}^\infty (\rho _i\nu _{is} + \rho _j\nu _{js})^2/\nu ^2_0\) which approximates \(2{{\mathrm{tr}}}({\varvec{\Xi }}^2)/[{{\mathrm{tr}}}({\varvec{\Xi }})]^2\), \({\varvec{\Xi }} = n{\varvec{\Sigma }}_0/p\), \({\varvec{\Sigma }}_0 = \sum _{i = 1}^g{\varvec{\Sigma }}_i/n_i\). Further \(z_{ij} = \sqrt{\rho _i\nu _{is}}z_{is} - \sqrt{\rho _j\nu _{js}}z_{js}\) is a linear combination of independent N(0, 1) variables, hence itself normal with mean 0, variance \(\rho _i\nu _{is} + \rho _j\nu _{js}\). To estimate \({{\mathrm{Var}}}(T_g)\), we note that the set of distinct non-zero elements in \({\varvec{\Lambda }}\) is

$$\begin{aligned} {\mathcal {S}} = \left\{ a_i = {{\mathrm{tr}}}({\varvec{\Sigma }}^2_i),~a_{ij} = {{\mathrm{tr}}}({\varvec{\Sigma }_i}{\varvec{\Sigma }}_j),~i, j = 1, \ldots , g,~i < j\right\} , \end{aligned}$$
(14)

with cardinality \(s_0 = \#\{{\mathcal {S}}\} = g(g + 1)/2\), i.e., for any g, we only need to estimate \(s_0\) elements out of \(G(G + 1)/2\) in order to estimate \({\varvec{\Lambda }}\). With the estimators of \({{\mathrm{tr}}}({\varvec{\Sigma }}^2_i)\), \([{{\mathrm{tr}}}({\varvec{\Sigma }}_i)]^2\) and \({{\mathrm{tr}}}({\varvec{\Sigma }}_i{\varvec{\Sigma }}_j)\) same as given in the two-sample case, a consistent plug-in estimator of \({{\mathrm{Var}}}(T_g)\) follows, leading to the following generalization of Corollary 14.

Corollary 18

Theorem 17 remains valid when \({{\mathrm{Var}}}(T_g)\) is replaced with \({\widehat{{{\mathrm{Var}}}}}(T_g)\).

Power of\(T_g\) For \(z_{\alpha }\) as before, \(\text {P}(T_g \ge z_{\alpha }\sqrt{{{\mathrm{Var}}}(T_g)} + (g - 1)) = \alpha \), so that \(1 - \beta = \text {P}(z_g \ge z_\alpha - n\delta )\) where, with \(z_g = (T_g - E(T_g))/\sqrt{{{\mathrm{Var}}}(T_g)}\), \(\delta = \delta _1/\delta _2\), \(\delta _1 = \sum _{i < j}^g\Vert {\varvec{\mu }}_i - {\varvec{\mu }}_j\Vert ^2/p\), \(\delta ^2_2 = {{\mathrm{tr}}}({\varvec{\Xi }}_0)\), \({\varvec{\Xi }}_0 = n{\varvec{\Sigma }}_0/p\), \({\varvec{\Sigma }}_0 = \sum _{i = 1}^g{\varvec{\Sigma }}_i/n_i\). For \(g = 2\), \(\delta _1 = \Vert {\varvec{\mu }}_1 - {\varvec{\mu }}_2\Vert ^2/p\), \({\varvec{\Sigma }}_0 = \sum _{i = 1}^2{\varvec{\Sigma }}_i/n_i\). A case of particular interest is when \({\varvec{\mu }}_i\) are mutually orthogonal, \({\varvec{\mu }}'_i{\varvec{\mu }}_j = 0\), \(\forall ~i < j\). The power function remains the same, now with \(\delta _1 = (g - 1)\sum _{i = 1}^g\Vert {\varvec{\mu }}_i\Vert ^2/p\) or, for \(g = 2\), \(\delta _1 = \Vert {\varvec{\mu }}_1\Vert ^2 + \Vert {\varvec{\mu }}_2\Vert ^2\).

Remark 19

This remark pertains to the trace estimators used to define consistent estimators of \({{\mathrm{Var}}}(T_g)\). Consider one-sample case where \(E_2\), \(E_3\) as estimators of \({{\mathrm{tr}}}({\varvec{\Sigma }}^2)\), \([{{\mathrm{tr}}}({\varvec{\Sigma }})]^2\), given after Theorem 5, are defined as functions of \(\widehat{\varvec{\Sigma }}\) to keep them simple in formulation and efficient in computation. Alternatively, however, the same estimators can be defined as U-statistics which helps study their properties, particularly consistency, more conveniently. Let \(\mathbf{D}_{kr} = \mathbf{X}_k - \mathbf{X}_r\), \(k \ne r\) and define \(A_{kr} = \mathbf{D}'_{kr}{} \mathbf{D}_{kr}\), \(A^2_{krls} = (\mathbf{D}'_{kr}{} \mathbf{D}_{ls})^2\). Then, we can equivalently write

$$\begin{aligned} E_2 = \frac{1}{P(n)}\underset{\pi (k,~ r,~ l,~ s)}{\sum _{k = 1}^{n}\sum _{r = 1}^{n}\sum _{l = 1}^{n}\sum _{s = 1}^{n}}\frac{1}{12}B_{krls}, \quad E_3 = \frac{1}{P(n)}\underset{\pi (k,~ r,~ l,~ s)}{\sum _{k = 1}^{n}\sum _{r = 1}^{n}\sum _{l = 1}^{n}\sum _{s = 1}^{n}}\frac{1}{12}C_{krls}. \end{aligned}$$

where \(B_{krls} = A^2_{krls} + A^2_{klrs} + A^2_{ksrl}\), \(C_{krls} = A_{kr}A_{ls} + A_{kl}A_{rs} + A_{ks}A_{lr}\), \(\pi (\cdot )\) means all indices pairwise unequal and \(P(n) = n(n - 1)(n - 2)(n - 3)\). This formulation of \(E_2\), \(E_3\) lends itself to be mathematically easily amenable using the theory of U-statistics. For details, see Ahmad (2016). The form extends directly to multi-sample cases by defining \(E_{2i}\), \(E_{3i}\) for ith independent sample in the same way, with \({{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_i){{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_j)\) and \({{\mathrm{tr}}}(\widehat{\varvec{\Sigma }}_i\widehat{\varvec{\Sigma }}_j)\) estimating \({{\mathrm{tr}}}({\varvec{\Sigma }}_i){{\mathrm{tr}}}({\varvec{\Sigma }}_j)\) and \({{\mathrm{tr}}}({\varvec{\Sigma }}_i{\varvec{\Sigma }}_j)\) as usual, where a U-statistic form of \({{\mathrm{tr}}}(\widehat{\varvec{\Sigma }})\) is \(\sum _{k \ne r}^nA_{kr}/n(n - 1)\). For details, see Ahmad (2017a, b).

Remark 20

Note that, the Chi-square approximation in both one- and multi-sample cases follows through two-moment approximation of the limit of the test statistics with that of a scaled Chi-square variable. Box (1954a, b) used this approximation to study the violation of assumptions of homoscedastic and uncorrelated errors in ANOVA settings, later extended and modified by Geissser and Greenhouse (1958), Greenhouse and Geissser (1959) and Huynh and Feldt (1970, 1976).

4 Simulations

We evaluate the accuracy of tests for size control and power, specifically focusing on violation of normality and homoscedasticity assumptions. We take g = 1 and 3 and generate data from Normal, Exponential and Uniform distributions with \(n = 10, 20, 50\) for \(T_1\) and \((n_1, n_2, n_3) = (10, 15, 20), (5, 25, 50), (10, 30, 60)\), for \(T_3\), where the last two triplets represent seriously unbalanced designs. For dimension, we take \(p \in \{50, 100, 300, 500, 1000\}\). For covariances structures, we use compound symmetry (CS), autoregressive of order 1, AR(1), as defined in Sect. 2.2, and unstructured (UN), defined as \({\varvec{\Sigma }} = (\sigma _{ij})_{i, j = 1}^d\) with \(\sigma _{ij} = 1(1)d\) (\(i = j\)), \(\rho _{ij} = (i - 1)/d\) (\(i > j\)), with \(\mathbf{I}\) as identity matrix and \(\mathbf{J}\) as matrix of 1s. We use \(\rho = 0.5\), \(\kappa = 1\).

We use \(\alpha = 0.01, 0.05, 0.10\) and estimate test size by averaging \(P(T\le T_o|H_0)\) over 1000 simulations, where \(T\) denotes \(T_1\) or \(T_3\) and \(T_o\) is their observed value under \(H_0\). Tables 2 and 3 report estimated size and power of \(T_1\) for normal and exponential distributions, and Tables 4 and 5 report the same for \(T_3\) for all distributions. For power, we fix \(\alpha = 0.05\) and estimate the power by averaging \(P(T\ge T_o|H_1)\) over 1000 runs, where \(H_1\) is defined as \({\varvec{\mu }} = \delta _r\mathbf{p}_1\), \(\mathbf{p}_1 = (1/p, \ldots , p/p)\), \(\delta _r = 0.2(0.2)1\). Note that, \(T_3\) is assessed under a triplet of covariance structures (CS, AR, UN) followed by the three populations.

We observe an accurate size control for normal as well as for non-normal distributions and under all covariance structures. The stability of the size control for increasing p, for n as small as 10, is also evident. We observe a similar performance for power, with discernably better performance under AR and UN structures than under CS, for all distributions, which might be attributed to the spiky nature of CS. The power, however, also improves reasonably under CS for increasing n and p. For \(g = 3\), we also observe accuracy for unbalanced design, with a drastic improvement for the last triplet of \(n_i\). Although not reported here, similar results were observed for other \(\rho \) values in CS and AR, for other covariance structures, e.g., Toeplitz, and for other distributions, e.g., t.

We also assessed the power of proposed tests under possible sparse alternatives. For simplicity, we report results for \(T_1\) for normal distribution with same n as used above and \(p \in \{60, 100, 200\}\). We consider three levels of sparsity: small, medium and large with 25%, 50% and 75% zeros in the mean vector, respectively. Note that, 0% sparsity implies the case under \(H_1\), where 100% sparsity implies the null case. Table 6 reports the results. Generally, the power is high under all parameter settings, indicating the validity of tests for such alternatives. Further, the power increases with increasing sample size, so that even under sparsity, the test shows a high probability to tell the null from the alternative, particularly as the sample size grows.

Table 2 Estimated size of \(T_1\): normal and exponential distributions
Table 3 Estimated power of \(T_1\): normal and exponential distributions

5 Analyses of real data sets

Figure 2 depicts average counts of macrobenthos observed along an approximately 2000 km long transact of Norwegian continental shelf. The transact under observation comprised a range of water depths and sediment properties. A total of \(p = 809\) species were observed from \(n = 101\) independent sites in five different regions of the transact, where \(n_1 = 16\), \(n_2 = 21\), \(n_3 = 25\), \(n_4 = 19\), \(n_5 = 20\). Each count is a five-replicate pooled observation, and the data contain a large amount of zeros where no species could be recorded. For details, see Ellingsen and Gray (2002).

Table 4 Estimated size of \(T_3\) for (CS, AR, UN) structures: all distributions
Table 5 Estimated power of \(T_3\) for (CS, AR, UN) structures: all distributions
Table 6 Estimated power of \(T_1\) under three sparse alternatives

In our notation, \(\mathbf{X} = (\mathbf{X}'_1, \ldots , \mathbf{X}'_5)' \in {\mathbb {R}}^{n\times p}\) represents the complete data matrix with regionwise data matrices \(\mathbf{X}_i = (\mathbf{X}'_{i1}, \ldots , \mathbf{X}'_{in_i})' \in {\mathbb {R}}^{n_i\times p}\), \(\mathbf{X}_{ik} \in {\mathbb {R}}^p\), where \(n_i\) and p are given above. It is thus an unbalanced one-way MANOVA experiment with \(g = 5\) independent samples, each of \(n_i\) iid vectors of dimension 809, where \(n = \sum _{i = 1}^5n_i = 101\). The linear model can be expressed as

$$\begin{aligned} \mathbf{X}_{ik} = {\varvec{\mu }}_i + {\varvec{\epsilon }}_{ik}, ~~j = 1, \ldots , n_i,~ i = 1, \ldots , 5, \end{aligned}$$
(15)

where the vector \(\mathbf{X}_{ik}\) consists of 809 species counts measured for \(k\hbox {th}\) replicate (site) from ith region, \({\varvec{\mu }}_i \in {\mathbb {R}}^p\) is the true average count vector of \(i\hbox {th}\) region, and \({\varvec{\epsilon }}_{ik} \in {\mathbb {R}}^p\) are random error vectors, associated with each \(\mathbf{X}_{ik}\), with \(E({\varvec{\epsilon }}_{ik}) = \mathbf{0}\) and \({{\mathrm{Cov}}}({\varvec{\epsilon }}_{ik}) = {\varvec{\Sigma }}_i\forall ~k\), \(i = 1, \ldots , 5\). The hypothesis of interest can be formulated as \(H_{05}: {\varvec{\mu }}_1 = \ldots = {\varvec{\mu }}_5\) vs \(H_{15}: {\varvec{\mu }}_i \ne {\varvec{\mu }}_j\) for at least one pair \(i \ne j\), i, \(j = 1, \ldots , 5\). We use \(T_g\) in Eq. (8) to test \(H_{05}\).

Fig. 2
figure 2

Average species count of macrobenthos data for five regions

We also apply the proposed test to two well-known data sets, referred here to as alcohol and leukemia data. The alcohol data is a two-group (\(g = 2\)) data that motivated Dempster to construct the first two-sample high-dimensional test (Dempster 1958); see also Dempster (1960, 1968). The data consist of \(p = 59\) biochemistry measurements on \(n_1 = 8\) alcoholic and \(n_2 = 4\) control individuals aged 16–39 years; see also Beerstecher et al. (1950). The three-group (\(g = 3\)) leukemia data are often also used for classification. It consist of measurements on patients with acute lymphoblastic leukemia (ALL) carrying a chromosomal translocation involving mixed-lineage leukemia (MLL) gene. A total of \(p = 11225\) gene expression profiles of leukemia cells are taken from patients in ALL group (\(n_1 = 28\)), B-precursor ALL carrying an MLL translocation (\(n_2 = 24\)) and conventional B-precursor without MLL translocation (\(n_3 = 20\)); see Armstrong et al. (2002) for details.

Model (15) remains the same for alcohol and leukemia data sets, with \(g = 2\) and 3, respectively, and with corresponding sample sizes given above. The analyses of all three data sets are reported in Table 7. The first three columns report the data sizes and the next three the Chi-square approximation for \(T_g\), and the penultimate two columns provide the corresponding normal approximation. Only for alcohol data, the results provide evidence in support of null hypothesis of no difference of mean vectors, whereas the hypotheses are significantly rejected for both leukemia and species data. The conclusions for all three data sets are consistent for both approximations. In particular, the results for species data substantiate what can be roughly witnessed in Fig. 2.

Table 7 Analysis of example data sets

6 Discussion and remarks

Test statistics for high-dimensional mean vectors are presented. A unified strategy is proposed that systematically encompasses one- and multi-sample cases. The tests are constructed as linear combinations of U-statistics-based estimators and are valid for any distribution with finite fourth moment. The limiting distributions of the tests are derived under a few mild assumptions. Simulations are used to show the accuracy of the tests for moderate sample size and any large dimension. The tests are location invariant, so that the mean vectors need not be assumed zero. Due to singularity of empirical covariance matrix in high-dimensional case, an affine-invariant test is not possible, and location-invariance is the best that can be achieved in this case.