1 Introduction

Let \(\mathcal{X}_m=\{\mathbf{x}_{1},\ldots ,\mathbf{x}_{m}\}\) and \(\mathcal{Y}_n=\{\mathbf{y}_{1},\ldots ,\mathbf{y}_{n}\}\) be two sets of independent observations from d-dimensional continuous distributions F and G, respectively. In the two-sample problem, we use these observations to test the null hypothesis \(\mathcal{H}_0:F=G\) against the alternative hypothesis \(\mathcal{H}_A:F \ne G\). This problem is well-investigated, and several tests are available for it. Interestingly, many of these tests are based on pairwise Euclidean distances. Under some mild conditions, Maa et al. (1996) showed that for \(\mathbf{X}_1,\mathbf{X}_2 {\mathop {\sim }\limits ^{i.i.d.}}F\) and \(\mathbf{Y}_1,\mathbf{Y}_2 {\mathop {\sim }\limits ^{i.i.d.}}G\), \(\Vert \mathbf{X}_1-\mathbf{X}_2\Vert \), \(\Vert \mathbf{Y}_1-\mathbf{Y}_2\Vert \) and \(\Vert \mathbf{X}_1-\mathbf{Y}_1\Vert \) have the same distribution if and only if F and G are identical. So, pairwise Euclidean distances contain useful information about the difference between two distributions. Also, these distances can be easily computed in any dimension. Because of these reasons, pairwise Euclidean distances have been extensively used for the construction of two-sample tests, which are applicable to high dimension, low sample size (HDLSS) data.

Existing tests based on pairwise Euclidean distances can be broadly categorized into two groups: (i) tests based on averages of three types (\(\mathbf{X}\mathbf{X}\), \(\mathbf{X}\mathbf{Y}\) and \(\mathbf{Y}\mathbf{Y}\)) of pairwise distances and (ii) tests based on graphs. Tests based on averages of pairwise distances include Baringhaus and Franz (2004, 2010); Székely and Rizzo (2004, 2013); Aslan and Zech (2005); Gretton et al. (2012); Biswas and Ghosh (2014); Sarkar and Ghosh (2018); Tsukada (2019).

Almost all graph-based tests consider an edge-weighted complete graph \(\mathcal{G}\) on the vertex set \(\mathcal{Z}_N=\mathcal{X}_m\cup \mathcal{Y}_n\) (here \(N=m+n\) is the total sample size), where the Euclidean distance between two vertices is taken to be the weight associated with the edge connecting them. Different tests consider different sub-graphs of \(\mathcal{G}\) and look at their topologies. The deviation of the topology of a sub-graph from the one expected under \(\mathcal{H}_0\) is used to construct the test statistic. Friedman and Rafsky (1979) used the minimum spanning tree (MST) of \(\mathcal{G}\) as the sub-graph to construct the Kolmogorov–Smirnov test and the Wald-Wolfowitz run test for multivariate data. Biswas et al. (2014) used the shortest Hamiltonian path (SHP) on \(\mathcal{G}\) to develop another multivariate run test. Rosenbaum (2005) constructed the cross-match test using \(\lfloor N/2 \rfloor \) disconnected edges of \(\mathcal{G}\) (here \(\lfloor t \rfloor \) is the largest integer \(\le \)t) for which the total edge weight is minimum. Liu and Modarres (2011) considered all cliques of size 3 to construct their test statistic. The tests based on nearest-neighbor type coincidences (see, e.g., Schilling 1986; Henze 1988; Mondal et al. 2015) can be viewed as tests based on directed sub-graphs of \(\mathcal{G}\), whereas the test by Hall and Tajvidi (2002) can be viewed as an aggregation of tests based on sub-graphs for different levels of neighborhood. Recently, Chen and Friedman (2017) also constructed some two-sample tests using graph-theoretic ideas.

In this article, we will mainly consider four tests, namely, the test based on nearest-neighbors (Schilling 1986; Henze 1988), the multivariate run test based on MST (Friedman and Rafsky 1979), the multivariate run test based on SHP (Biswas et al. 2014) and the cross-match test based on optimal non-bipartite matching (Rosenbaum 2005) for our investigation. We will refer to these four tests as the NN test, the MST-run test, the SHP-run test and the NBP test, respectively. Brief descriptions of these tests are given below.

NN test (Schilling 1986; Henze 1988): Consider the edge-weighted complete graph \(\mathcal{G}\) on the vertex set \(\mathcal{Z}_N\), as discussed above. Assume that an undirected edge \((\mathbf{u},\mathbf{v})\) in \(\mathcal{G}\) corresponds to two directed edges \((\overrightarrow{\mathbf{u},\mathbf{v}})\) and \((\overrightarrow{\mathbf{v}, \mathbf{u}})\). For a fixed \(k < N\), consider the sub-graph \(\mathcal{T}_{k}\), which contains an edge \((\overrightarrow{\mathbf{u},\mathbf{v}})\) if and only if \(\mathbf{v}\) is among the first k nearest-neighbors (in terms of the Euclidean distance) of \(\mathbf{u}\). So, \(\mathcal{T}_k\) contains Nk directed edges. The NN test uses the test statistic \(T_{NN} = \frac{1}{Nk} \sum _{(\overrightarrow{\mathbf{u},\mathbf{v}}) \in \mathcal{T}_k} {\mathbb {I}}(\mathbf{u},\mathbf{v})\), where \({\mathbb {I}}(\mathbf{u},\mathbf{v})\) is an indicator variable that takes the value 1 if \(\mathbf{u}\) and \(\mathbf{v}\) are from the same distribution. It rejects \(\mathcal{H}_0\) for large values of \(T_{NN}\). A more familiar expression of this test statistic is \(T_{NN} = \frac{1}{Nk} \big [\sum _{i=1}^m \sum _{r=1}^k {\mathbb {I}}_r(\mathbf{x}_i) +\sum _{i=1}^n \sum _{r=1}^k {\mathbb {I}}_r(\mathbf{y}_i)\big ]\), where \({\mathbb {I}}_r(\mathbf{z})\) is an indicator variable that takes the value 1 if \(\mathbf{z}\) and its r-th nearest-neighbor (in terms of the Euclidean distance) come from the same distribution.

MST-run test (Friedman and Rafsky 1979): Unlike the NN test, this test is based on an undirected sub-graph of \(\mathcal{G}\). Let \(\mathcal{M}\) be the minimum spanning tree (MST) of \(\mathcal{G}\). The MST-run test uses the test statistic \(T_{MST} = 1 + \sum _{i=1}^{N-1} \lambda ^\mathcal{M}_i\), where \(\lambda ^\mathcal{M}_i\) is an indicator variable that takes the value 1 if and only if the i-th edge \((i=1,\ldots ,N-1)\) of \(\mathcal{M}\) connects two observations from different distributions. The null hypothesis \(\mathcal{H}_0\) is rejected for small values of \(T_{MST}\).

SHP-run test (Biswas et al. 2014): Instead of MST, it uses the shortest Hamiltonian path (SHP). Let \(\mathcal{S}\) be the SHP on \(\mathcal{G}\). The number of runs along \(\mathcal{S}\) is computed as \(T_{SHP} = 1 + \sum _{i=1}^{N-1} \lambda ^\mathcal{S}_i\), where the indicator \(\lambda ^\mathcal{S}_i\) takes the value 1 if and only if the i-th edge of \(\mathcal{S}\) connects two observations from different distributions. The SHP-run test rejects \(\mathcal{H}_0\) for small values of \(T_{SHP}\).

NBP test (Rosenbaum 2005): It uses the optimal non-bipartite matching algorithm (see, e.g., Lu et al. 2011) to find \(\lfloor N/2\rfloor \) disconnected edges (i.e., no two edges share a common vertex) in \(\mathcal{G}\) such that the total weight of the edges is minimum. Let \(\mathcal{C}=\{(\mathbf{u}_{i},\mathbf{v}_{i}): ~i=1,\ldots ,\lfloor N/2 \rfloor \}\) be the collection of these edges. The NBP test rejects \(\mathcal{H}_0\) for small values of the test statistic \(T_{NBP} = \sum _{i=1}^{\lfloor N/2 \rfloor } \lambda ^\mathcal{C}_i\), where \(\lambda ^\mathcal{C}_i\) is an indicator variable that takes the value 1 if and only if \(\mathbf{u}_{i}\) and \(\mathbf{v}_{i}\) are from two different distributions.

These tests based on pairwise Euclidean distances are known to work well in many high-dimensional problems even when the dimension is much larger than the sample size. But they also have some limitations. To demonstrate this, we consider the following examples.

Example 1

F and G are Gaussian with the same mean \(\mathbf{0}_d = (0,\ldots ,0)^{\top }\) and diagonal dispersion matrices \({\varvec{\Lambda }}_{1,d}\) and \( {\varvec{\Lambda }}_{2,d}\), respectively. The first \(\lfloor d/2 \rfloor \) diagonal elements of \({\varvec{\Lambda }}_{1,d}\) are 1 and the rest are 2, whereas for \({\varvec{\Lambda }}_{2,d}\), the first \(\lfloor d/2 \rfloor \) diagonal elements are 2 and the rest are 1.

Example 2

For \(\mathbf{X}=(X^{(1)},\ldots ,X^{(d)})^{\top } \sim F\), \(X^{(1)},\ldots ,X^{(d)}\) are i.i.d. \(\mathcal{N}(0,5)\), while for \(\mathbf{Y}=(Y^{(1)},\ldots ,Y^{(d)})^{\top } \sim G\), \(Y^{(1)},\ldots ,Y^{(d)}\) are i.i.d. \(t_{5}(0,3)\). Here \(\mathcal{N}(\mu ,\sigma ^2)\) denotes the normal distribution with mean \(\mu \) and variance \(\sigma ^2\), and \(t_{\nu }(\mu ,\sigma ^2)\) denotes the Student’s t-distribution with \(\nu \) degrees of freedom, location \(\mu \) and scale \(\sigma \).

For each of these examples, we carried out our experiment with \(d=2^{i}\) for \(i=1,\ldots ,10\). For different values of d, we generated 20 observations from each distribution and used them to test \(\mathcal{H}_0:F=G\) against \(\mathcal{H}_A:F \ne G\). We repeated each experiment 500 times and estimated the power of a test by the proportion of times it rejected \(\mathcal{H}_0\). These estimated powers are reported in Fig. 1. The SHP-run test and the NBP test are distribution-free. For the other two tests, we used conditional tests based on 1000 random permutations. For the NN test, we used \(k=3\) for all numerical work since it has been reported to perform well in the literature (see, e.g., Schilling 1986). Throughout this article, all tests are considered to have 5% nominal level.

Fig. 1
figure 1

Powers of NN, MST-run, SHP-run and NBP tests in Examples 1 and 2

Figure 1 clearly shows that all these tests had poor performance in Examples 1 and 2. Note that in both of these examples, each measurement variable has different distributions under F and G. So, each of them carries a signal against \(\mathcal{H}_0\). Therefore, the power of a reasonable test is expected to increase to one as the dimension increases. But we did not observe that for these tests based on pairwise Euclidean distances.

The reason behind the poor performance by these tests can be explained using the distance concentration phenomenon in high dimensions. To understand it, let us consider four independent random vectors \(\mathbf{X}_1,\mathbf{X}_2 \sim F\) and \(\mathbf{Y}_1,\mathbf{Y}_2 \sim G\), where F and G have mean vectors \({\varvec{\mu }}_F,{\varvec{\mu }}_G\) and dispersion matrices \({\varvec{\Sigma }}_F,{\varvec{\Sigma }}_G\), respectively. We also make the following assumptions.

Assumption 1

For \(\mathbf{X}\sim F\) and \(\mathbf{Y}\sim G\), fourth moments of the component variables of \(\mathbf{X}\) and \(\mathbf{Y}\) are uniformly bounded.

Assumption 2

For \(\mathbf{W}=\mathbf{X}_{1}-\mathbf{X}_{2}\), \(\mathbf{Y}_{1}-\mathbf{Y}_{2}\) and \(\mathbf{X}_{1}-\mathbf{Y}_{1}\), \(\sum _{1 \le q \ne q^\prime \le d} \text {corr}\big \{(W^{(q)})^{2},(W^{(q^\prime )})^{2}\big \}\) is of the order \(\mathbf{o}(d^{2})\).

Assumption 3

There exist non-negative constants \(\nu ^2\), \(\sigma _{F}^{2}\) and \(\sigma _{G}^{2}\) such that \(d^{-1}\Vert {\varvec{\mu }}_F-{\varvec{\mu }}_G\Vert ^{2}\rightarrow \nu ^{2}\), \(d^{-1} trace({\varvec{\Sigma }}_F) \rightarrow \sigma _{F}^{2}\) and \(d^{-1} trace({\varvec{\Sigma }}_G) \rightarrow \sigma _{G}^{2}\) as \(d \rightarrow \infty \).

These assumptions are quite common in the HDLSS asymptotic literature (see, e.g., Hall et al. 2005; Jung and Marron 2009; Biswas et al. 2014; Dutta et al. 2016). Under Assumptions 1 and 2, the weak law of large numbers (see, e.g., Billingsley 1995) holds for the sequence of possibly dependent and non-identically distributed random variables \(\{(W^{(q)})^2: q \ge 1\}\), i.e., \(|d^{-1}\sum _{i=1}^{d} (W^{(q)})^2-d^{-1}\sum _{i=1}^{d} E((W^{(q)})^2)| {\mathop {\rightarrow }\limits ^{Pr}}0\) as \(d \rightarrow \infty \). Assumption 3 gives the limiting value of \(d^{-1}\sum _{i=1}^{d} E((W^{(q)})^2)\). These three assumptions hold in Examples 1 and 2, and under these assumptions, we have the following result on the high-dimensional behavior of pairwise Euclidean distances. The proof follows from the discussions in Sections 3.1 and 3.2 of Hall et al. (2005).

Lemma 1

(Hall et al. 2005) Suppose that \(\mathbf{X}_1,\mathbf{X}_2 \sim F\) and \(\mathbf{Y}_1,\mathbf{Y}_2 \sim G\) are independent random vectors. If F and G satisfy Assumptions 13, then \(d^{-1/2} \Vert \mathbf{X}_1-\mathbf{X}_2\Vert {\mathop {\rightarrow }\limits ^{\text {Pr}}}\sigma _F \sqrt{2}\), \(d^{-1/2} \Vert \mathbf{Y}_1-\mathbf{Y}_2\Vert {\mathop {\rightarrow }\limits ^{\text {Pr}}}\sigma _G \sqrt{2}\) and \(d^{-1/2} \Vert \mathbf{X}_1-\mathbf{Y}_1\Vert {\mathop {\rightarrow }\limits ^{\text {Pr}}}\sqrt{\sigma _F^2 + \sigma _G^2 + \nu ^2}\) as d tends to infinity.

Under Assumptions 13, using Lemma 1, Biswas et al. (2014) proved the high-dimensional consistency (i.e., the convergence of power to 1 as d tends to infinity) of the SHP-run test when \(\nu ^2 > 0\) or \(\sigma _F^2 \ne \sigma _G^2\). Under the same condition, one can show this consistency for the NBP test as well (follows using arguments similar to those used in the proof of part (b) of Theorem 1). When \(\nu ^2 > |\sigma _F^2-\sigma _G^2|\), such high-dimensional consistency can also be proved for the NN test (follows using arguments similar to those used in the proof of part (a) of Theorem 2) and the MST-run test (see Biswas et al. 2014). However, in Examples 1 and 2, we have \(\nu ^2=0\) and \(\sigma _F^2=\sigma _G^2\). So, \(d^{-1/2}\Vert \mathbf{X}_1-\mathbf{X}_2\Vert \), \(d^{-1/2}\Vert \mathbf{Y}_1-\mathbf{Y}_2\Vert \) and \(d^{-1/2}\Vert \mathbf{X}_1-\mathbf{Y}_1\Vert \) all converge to the same value (follows from Lemma 1). Therefore, tests based on pairwise Euclidean distances fail to capture the difference between two underlying distributions and have poor performance.

Fig. 2
figure 2

Powers of different graph-based tests and their modified versions based on \(\varphi _{h,\psi }\) in Example 1. \(T^{\mathrm{lin}}\) & \(T^{\exp }\) correspond to tests based on \(\psi _1(t)=t\) and \(\psi _2(t)=1-\exp (-t)\), respectively

Fig. 3
figure 3

Powers of different graph-based tests and their modified versions based on \(\varphi _{h,\psi }\) in Example 2. \(T^{\mathrm{lin}}\) & \(T^{\exp }\) correspond to tests based on \(\psi _1(t)=t\) and \(\psi _2(t)=1-\exp (-t)\), respectively

Recently, Sarkar and Ghosh (2018) identified this problem for tests based on averages of pairwise Euclidean distances. To cope with such situations, instead of the Euclidean distance, they suggested to use distance functions of the form \(\varphi _{h,\psi }(\mathbf{u},\mathbf{v})= h\{d^{-1}\sum _{q=1}^{d} \psi (|u^{(q)}-v^{(q)}|)\}\) for suitably chosen strictly increasing functions \(h,\psi :[0,\infty )\rightarrow [0, \infty )\) with \(h(0)=\psi (0)=0\). Now, one may be interested to know what happens to the graph-based tests if they are constructed using \(\varphi _{h, \psi }\) (i.e., the edge-weights in \(\mathcal{G}\) are defined using \(\varphi _{h,\psi }\)). For this investigation, here we consider two choices of \(\psi \), namely, \(\psi _1(t)=t\) and \(\psi _2(t)=1-\exp (-t)\), with \(h(t)=t\) in both cases. Figures 2 and 3 show powers of these modified tests in Examples 1 and 2 (curves corresponding to \({T}^{\mathrm{lin}}\) and \({T}^{\exp }\) represent the tests based on \(\psi _1\) and \(\psi _2\), respectively). These tests had excellent performance in Example 1. Their powers converged to one as the dimension increased. Modified versions of SHP-run test had similar behavior in Example 2 as well. In this example, powers of modified NBP tests also increased with the dimension, but those of modified NN and MST-run tests dropped down to zero as the dimension increased. In the next section, we investigate the reasons behind such contrasting behavior of these tests based on \(\varphi _{h, \psi }\).

2 High-dimensional behavior of the tests based on \(\varphi _{h,\psi }\)

In this section, we carry out a theoretical investigation on the high-dimensional behavior of the modified tests based on \(\varphi _{h,\psi }\). For this investigation, we make the following assumption.

Assumption 4

Let \(\mathbf{X}_1,\mathbf{X}_2 \sim F\), \(\mathbf{Y}_1,\mathbf{Y}_2 \sim G\) be independent random vectors. The function \(\psi :[0,\infty ) \rightarrow [0,\infty )\) satisfies \(\psi (0)=0\), and for \(\mathbf{W}=\mathbf{X}_1-\mathbf{X}_2\), \(\mathbf{Y}_1-\mathbf{Y}_2\) and \(\mathbf{X}_1-\mathbf{Y}_1\), \(d^{-1} \sum _{q=1}^d \big \{\psi (|W^{(q)}|) - \text {E}\psi (|W^{(q)}|)\big \} \overset{\text {Pr}}{\rightarrow } 0\) as \(d \rightarrow \infty \).

Recall that Assumptions 1 and 2 lead to probability convergence of \(d^{-1}\sum _{d=1}\{(W^{(q)})^2-E((W^{(q)})^2)\}\). So, Assumption 4 can be viewed as a generalization of Assumptions 1 and 2 with \((W^{(q)})^2\) replaced by \(\psi (|W^{(q)}|)\). Note that Assumption 1 holds if and only if the second moments of \(\{(W^{(q)})^2:~q\ge 1\}\) are uniformly bounded. So, if the random variables \(\{\psi (|W^{(q)}|) : q \ge 1\}\) have uniformly bounded second moments (which are trivial if \(\psi \) is bounded), and they are weakly dependent in the sense that \(\sum _{1\le q \ne q^\prime \le d}\text {corr}\big \{\psi (|W^{(q)}|),\psi (|W^{(q^\prime )}|)\big \}=\mathbf{o}(d^{2})\), Assumption 4 holds. In particular, it holds if the sequence \(\{\psi (|W^{(q)}|) : q \ge 1\}\) is m-dependent (see, e.g., Billingsley 1995, page 90) or \(\rho \)-mixing (see, e.g., Lin and Lu 1996, page 4). However, Assumption 4 holds in many other situations. For instance, some sufficient conditions for mixingale sequence of random variables have been derived by Andrews (1988) and de Jong (1995). Dutta et al. (2016, Section 7) also made some assumptions for deriving the week law for the sequence \(\{(W^{(q)})^2:~q\ge 1\}\). Similar assumptions with appropriate modifications for \(\psi \) also lead to Assumption 4.

If h is uniformly continuous (which we will assume throughout this article, unless otherwise mentioned), under Assumption 4, \(\big \{\varphi _{h,\psi }(\mathbf{X}_1,\mathbf{X}_2) - \varphi _{h,\psi }^*(F,F)\big \}\), \(\big \{\varphi _{h,\psi }(\mathbf{Y}_1,\mathbf{Y}_2) - \varphi _{h,\psi }^*(G,G)\big \}\) and \(\big \{\varphi _{h,\psi }(\mathbf{X}_1,\mathbf{Y}_1) - \varphi _{h,\psi }^*(F,G)\big \}\) converge in probability to 0 as d tends to infinity, where

$$\begin{aligned}&\varphi ^*_{h,\psi }(F,F) = h\Big \{{d}^{-1} \sum _{q=1}^d \text {E}\psi (|X_1^{(q)}-X_2^{(q)}|)\Big \},\nonumber \\&\varphi ^*_{h,\psi }(G,G) = h\Big \{{d}^{-1}\sum _{q=1}^d \text {E}\psi (|Y_1^{(q)}-Y_2^{(q)}|)\Big \} \quad \text {and }\nonumber \\&\varphi ^*_{h,\psi }(F,G) = h\Big \{{d}^{-1}\sum _{q=1}^d \text {E}\psi (|X_1^{(q)}-Y_1^{(q)}|)\Big \}. \end{aligned}$$
(1)

An interesting lemma involving these three quantities is stated below.

Lemma 2

(Sarkar and Ghosh 2018, Lemma 1) Suppose that h is a strictly increasing, concave function and \(\psi ^\prime (t)/t\) is a non-constant, monotone function. Then, for any fixed \(d\ge 1\), \(e_{h,\psi }(F,G) = 2 \varphi ^*_{h,\psi }(F,G) - \varphi ^*_{h,\psi }(F,F) - \varphi ^*_{h,\psi }(G,G) \ge 0\), where the equality holds if and only if F and G have the same univariate marginal distributions.

The quantity \(e_{h,\psi }(F,G)\) can be viewed as an energy distance between the distributions F and G (see, e.g., Székely and Rizzo 2004; Aslan and Zech 2005), and it serves as a measure of separation between the two distributions. Lemma 2 shows that for every \(d\ge 1\), \(e_{h,\psi }(F,G)\) is positive unless F and G have the same univariate marginals. For deriving high-dimensional results, we assume that \({\widetilde{e}}_{h,\psi }(F,G) = \liminf _{d \rightarrow \infty } e_{h,\psi }(F,G)\)\(>0\), which ensures that the energy distance between the two populations is asymptotically non-negligible. The following theorem shows the high-dimensional consistency of SHP-run and NBP tests based on \(\varphi _{h,\psi }\) under this assumption.

Theorem 1

Let \(\mathbf{X}_1,\ldots ,\mathbf{X}_m\sim F\) and \(\mathbf{Y}_1,\ldots ,\mathbf{Y}_n\sim G\)\((\text{ where }~m+n=N)\) be independent random vectors, where F and G satisfy Assumption 4 with \({\widetilde{e}}_{h,\psi }(F,G) = \liminf _{d \rightarrow \infty } e_{h,\psi }(F,G) > 0\).

  1. (a)

    If \(N/\left( {\begin{array}{c}N\\ m\end{array}}\right) < \alpha \), then the power of the SHP-run test (of level \(\alpha \)) based on \(\varphi _{h,\psi }\) converges to 1 as d tends to infinity.

  2. (b)

    If \(c(m,n) < \alpha \), then the power of the NBP test (of level \(\alpha \)) based on \(\varphi _{h,\psi }\) converges to 1 as d tends to infinity. Here, c(mn) is given by

$$\begin{aligned} c(m,n) = {\left\{ \begin{array}{ll} \frac{\left( N/2\right) !}{\left( {\begin{array}{c}N\\ m\end{array}}\right) \left( m/2\right) ! \left( n/2\right) !}, &{}\text { if both } m,n \text { are even}\\ \frac{2\left( N/2\right) !}{\left( {\begin{array}{c}N\\ m\end{array}}\right) \left( (m-1)/2\right) ! \left( (n-1)/2\right) !}, &{}\text { if both } m,n \text { are odd}\\ \frac{\left( (N-1)/2\right) !}{\left( {\begin{array}{c}N-1\\ m\end{array}}\right) \left( m/2\right) ! \left( (n-1)/2\right) !}, &{}\text { if } m \text { is even and }n \text { is odd}\\ \frac{\left( (N-1)/2\right) !}{\left( {\begin{array}{c}N-1\\ m-1\end{array}}\right) \left( (m-1)/2\right) ! \left( n/2\right) !},&\text { if } m \text { is odd and }n \text { is even} \end{array}\right. } \end{aligned}$$

Theorem 1 shows that if \({\widetilde{e}}_{h,\psi }(F,G)>0\), then SHP-run and NBP tests based on \(\varphi _{h, \psi }\) have the high-dimensional consistency if the sample sizes are not too small. In view of Lemma 2, for our two choices of h and \(\psi \), we have \({{\widetilde{e}}}_{h,\psi }(F,G)>0\) in Examples 1 and 2. This was the reason behind the excellent performance by these tests. However, for the tests based on the Euclidean distance (i.e., when \(h(t)=\sqrt{t}\) and \(\psi (t)=t^2)\), we have \({\widetilde{e}}_{h,\psi }(F,G)=2\sqrt{\sigma _F^2+\sigma _G^2+\nu ^2}-\sigma _F\sqrt{2}-\sigma _G\sqrt{2}\) (follows from Lemma 1), which is positive if and only if \(\nu ^2 > 0\) or \(\sigma _F^2 \ne \sigma _G^2\). This was not the case in Examples 1 and 2, where all tests based on the Euclidean distance had poor performance.

For the high-dimensional consistency of NN and MST-run tests based on \(\varphi _{h,\psi }\), we need some additional conditions, as shown by the following theorem.

Theorem 2

Let \(\mathbf{X}_1,\ldots ,\mathbf{X}_m \sim F\) and \(\mathbf{Y}_1,\ldots ,\mathbf{Y}_n\sim G\)\((\text{ where }~m+n=N)\) be independent random vectors, where F and G satisfy Assumption 4. Also assume that both \(\liminf _{d \rightarrow \infty } \{\varphi _{h,\psi }^{*}(F,G) - \varphi _{h,\psi }^*(F,F)\}\) and \(\liminf _{d \rightarrow \infty } \{\varphi _{h,\psi }^*(F,G) - \varphi _{h,\psi }^*(G,G)\}\) are positive.

  1. (a)

    Define \(N_0 = \lceil N/(k+1) \rceil \) and \(m_0 = \lceil \min \{m,n\}/(k+1) \rceil \) (here \(\lceil t \rceil \) denotes the smallest integer larger than or equal to t). If \(k < \min \{m,n\}\) and \(\left( {\begin{array}{c}N_0\\ m_0\end{array}}\right) < \alpha \left( {\begin{array}{c}N\\ m\end{array}}\right) \), then the power of the NN test (of level \(\alpha \)) based on \(\varphi _{h,\psi }\) converges to 1 as d tends to infinity.

  2. (b)

    If \(\max \{\lfloor N/m \rfloor , \lfloor N/n \rfloor \} < \alpha \left( {\begin{array}{c}N\\ m\end{array}}\right) \), then the power of the MST-run test (of level \(\alpha \)) based on \(\varphi _{h,\psi }\) converges to 1 as d tends to infinity.

In Theorems 1 and 2, for the high-dimensional consistency of different two-sample tests, we need some conditions on sample sizes. These conditions are satisfied if m and n are not too small. For instance, for \(\alpha = 0.05\), we have the consistency for NN and MST-run tests when \(m,n \ge 4\). For the SHP-run test and the NBP test, we need \(m,n \ge 5\) and \(m,n \ge 8\), respectively. For smaller values of \(\alpha \), we need larger sample sizes. To have a better understanding on the requirement of sample sizes, for different choices of m and n, we have computed the minimum value of \(\alpha \) for which the resulting test turns out to be consistent. These minimum values are plotted in Fig. 4. It shows that for NN, MST-run and SHP-run tests, the minimum value of \(\alpha \) drops monotonically as the sample sizes increase. For the NBP test, we see two different type of results corresponding to even and odd sample sizes. However, for \(m,n \ge 10\), the minimum value of \(\alpha \) is very close to zero for all of these tests.

Fig. 4
figure 4

Minimum size of a consistent test for different sample sizes

The conditions \(\liminf _{d \rightarrow \infty } \{\varphi _{h,\psi }^{*}(F,G) - \varphi _{h,\psi }^*(F,F)\} > 0\) and \(\liminf _{d \rightarrow \infty } \{\varphi _{h,\psi }^*(F,G) - \varphi _{h,\psi }^*(G,G)\} > 0\) assumed in Theorem 2 ensure that the neighborhood structure (in terms of \(\varphi _{h,\psi }\)) is preserved in high dimensions, i.e., an observation and its nearest neighbor (in terms of \(\varphi _{h,\psi }\)) come from the same distribution with high probability. For the Euclidean distance (i.e., for \(h(t) = \sqrt{t}\) and \(\psi (t) = t^2\)), these conditions are equivalent to \(\nu ^2 > |\sigma _F^2 - \sigma _G^2|\) (follows from Lemma 1), i.e., the location-difference between two distributions dominates the overall scale-difference. In Example 1, from the descriptions of F and G, it is easy to check that \(\lim _{d \rightarrow \infty } \varphi _{h,\psi }^*(F,F) = \lim _{d \rightarrow \infty } \varphi ^*_{h,\psi }(G,G)\). So, in view of Lemma 2, we have \(\lim _{d \rightarrow \infty } \{\varphi _{h,\psi }^{*}(F,G) - \varphi _{h,\psi }^*(F,F)\}>0\) and \(\lim _{d \rightarrow \infty } \{\varphi _{h,\psi }^*(F,G) - \varphi _{h,\psi }^*(G,G)\}>0\). But in Example 2, one can show that \(\varphi _{h,\psi }^*(F,G)\) lies between \(\varphi _{h,\psi }^*(F,F)\) and \(\varphi _{h,\psi }^*(G,G)\). This violation of neighborhood structure led to poor performance by NN and MST-run tests based on \(\varphi _{h,\psi }\). The following theorem shows that in such situations, powers of these tests may even drop down to zero.

Theorem 3

Let \(\mathbf{X}_1,\ldots ,\mathbf{X}_m\sim F\) and \(\mathbf{Y}_1,\ldots ,\mathbf{Y}_n\sim G\) be independent random vectors, where F and G satisfy Assumption 4. Also assume that \(\limsup _{d \rightarrow \infty } \{\varphi _{h,\psi }^*(F,G) - \varphi _{h,\psi }^*(F,F)\} < 0\) (interchange F and G if required, and in that case, interchange m and n accordingly).

  1. (a)

    If \(k < \min \{m,n\}\) and \((m-1)/n > (1+\alpha )/(1-\alpha )\), then the power of the NN test (of level \(\alpha \)) based on \(\varphi _{h,\psi }\) converges to 0 as d tends to infinity.

  2. (b)

    If \(m/n > (1+\alpha )/(1-\alpha )\), then the power of the MST-run test (of level \(\alpha \)) based on \(\varphi _{h,\psi }\) converges to 0 as d tends to infinity.

The conditions involving m and n in Theorem 3 are only sufficient. Note that these conditions do not hold in Example 2, but NN and MST-run tests based on \(\varphi _{h,\psi }\) had powers close to 0. To overcome the limitations of NN and MST-run tests, in the next section, we introduce a new class of dissimilarity measures called MADD and use it for further modification of the tests.

3 Further modification of tests based on MADD

From the discussion in the previous section, it is clear that in order to have good performance by NN and MST-run tests in high dimensions, we need to use a distance function or dissimilarity measure which preserves the neighborhood structure. To achieve this, we define the dissimilarity between two observations \(\mathbf{x}\) and \(\mathbf{y}\) in \(\mathcal{Z}_N\) as

$$\begin{aligned} \rho _{h,\psi }(\mathbf{x},\mathbf{y}) = \frac{1}{N-2} \sum _{\mathbf{z}\in \mathcal{Z}_N \setminus \{\mathbf{x},\mathbf{y}\}} \Bigl |\varphi _{h,\psi }(\mathbf{x},\mathbf{z}) - \varphi _{h,\psi }(\mathbf{y},\mathbf{z})\Bigr |, \end{aligned}$$
(2)

where \(\varphi _{h,\psi }\) is as defined in Sect. 1. Since this dissimilarity index is based on the Mean of Absolute Differences of pairwise Distances, we call it MADD (see also Sarkar and Ghosh 2019). Using \(h(t)=\sqrt{t}\) and \(\psi (t) = t^2\), we get MADD based on the Euclidean distance, which is defined as

$$\begin{aligned} \rho _0(\mathbf{x},\mathbf{y}) = \frac{1}{N-2} \sum _{\mathbf{z}\in \mathcal{Z}_N {\setminus } \{\mathbf{x},\mathbf{y}\}} d^{-1/2}\big |\Vert \mathbf{x}-\mathbf{z}\Vert -\Vert \mathbf{y}-\mathbf{z}\Vert \big |. \end{aligned}$$
(3)

MADD has several desirable properties. One such property is mentioned below.

Lemma 3

For \(N\ge 3\), the dissimilarity index \(\rho _{h,\psi }\) is a semi-metric on \(\mathcal{Z}_N\).

The index \(\rho _{h,\psi }\) is not a metric since \(\rho _{h,\psi }(\mathbf{x},\mathbf{y})=0\) does not necessarily imply \(\mathbf{x}=\mathbf{y}\). However, if F and G are absolutely continuous distributions, then for any \(\mathbf{x}\ne \mathbf{y}\), \(\rho _{h,\psi }(\mathbf{x},\mathbf{y})\) is strictly positive with probability 1. So, in practice, \(\rho _{h,\psi }\) behaves like a metric. When \(\varphi _{h,\psi }\) is a metric, using the triangle inequality, we also get \(\rho _{h,\psi }(\mathbf{x},\mathbf{y})\le \varphi _{h,\psi }(\mathbf{x},\mathbf{y})\). So, closeness in terms of \(\varphi _{h,\psi }\) indicates closeness in terms of \(\rho _{h,\psi }\), but not the other way around. Note that if h is uniformly continuous, under Assumption 4, we have the probability convergence of \(\big |\varphi _{h,\psi }(\mathbf{X}_1,\mathbf{X}_2)-\varphi _{h,\psi }^*(F,F)\big |\), \(\big |\varphi _{h,\psi }(\mathbf{Y}_1,\mathbf{Y}_2)-\varphi _{h,\psi }^*(G,G)\big |\) and \(\big |\varphi _{h,\psi }(\mathbf{X}_1,\mathbf{Y}_1)-\varphi _{h,\psi }^*(F,G)\big |\) to 0 as d tends to infinity. This leads to the probability convergence of \(\rho _{h,\psi }(\mathbf{X}_1,\mathbf{X}_2)\), \(\rho _{h,\psi }(\mathbf{Y}_1,\mathbf{Y}_2)\) and \(\{\rho _{h,\psi }(\mathbf{X}_1,\mathbf{Y}_1)-\rho _{h,\psi }^{*}(F,G)\}\) to 0, where

$$\begin{aligned}&\rho _{h,\psi }^{*}(F,G)=\frac{1}{N-2}\\&\quad \Big \{(m-1)\bigl |\varphi _{ h,\psi }^*(F,G)-\varphi _{h,\psi }^*(F,F)\bigr | + (n-1)\bigl |\varphi _{h,\psi }^*(F,G)-\varphi _{h,\psi }^*(G,G)\bigr |\Big \}. \end{aligned}$$

So, in the case of high-dimensional data, unlike \(\varphi _{h, \psi }\), the dissimilarity index \(\rho _{h,\psi }\) usually takes small values for observations from the same distribution. The quantity \(\rho _{h,\psi }^{*}(F,G)\) is non-negative. But, in order to preserve the neighborhood structure (in terms of \(\rho _{h,\psi }\)) in high dimensions, we need to choose the functions h and \(\psi \) such that \(\rho _{h,\psi }^{*}(F,G)\) is strictly positive. The following lemma provides some guidance in this regard.

Lemma 4

Let \(h,\psi :[0,\infty ) \rightarrow [0,\infty )\) be strictly increasing functions such that \(h(0)=\psi (0)=0\) and \(\psi ^\prime (t)/t\) is a non-constant, completely monotone function. Then, for every \(d \ge 1\), \(\rho _{h,\psi }^*(F,G)\) is positive unless F and G have the same univariate marginal distributions.

Figures 5 and 6 show the performance of NN and MST-run tests in Examples 1 and 2, when they were constructed based on \(\rho _{h,\psi }\) for three different choices of h and \(\psi \): (i) \(h_0(t)=\sqrt{t}, \psi _0(t)=t^2\) (i.e., the Euclidean distance), (ii) \(h_1(t)=t, \psi _1(t)=t\) and (iii) \(h_2(t)=t,\psi _2(t)=1-\exp (-t)\). We denote the corresponding dissimilarity indices by \(\rho _0\), \(\rho _1\) and \(\rho _2\), respectively. Though NN and MST-run tests based on \(\rho _0\) could not perform well (see the curves corresponding to \({\widetilde{T}}^{\rho _0}_{NN}\) and \({{\widetilde{T}}}^{\rho _0}_{MST}\)), those based on \(\rho _1\) and \(\rho _2\) had excellent performance (see the curves corresponding to \({{\widetilde{T}}}^{\rho _1}_{NN}\), \({\widetilde{T}}^{\rho _2}_{NN}\) and \({{\widetilde{T}}}^{\rho _1}_{MST}\), \({\widetilde{T}}^{\rho _2}_{MST}\), respectively). Note that \(\psi _1\) and \(\psi _2\) satisfy the conditions stated in Lemma 4. So, for these two choices, we have \(\rho _{h,\psi }^*(F,G)>0\), and hence the neighborhood structure is preserved. This was the reason behind the excellent performance by these tests. The function \(\psi _0\) does not satisfy the conditions of Lemma 4. For \(\rho _0\), the neighborhood structure is preserved when \(\nu ^2>0\) or \(\sigma _1^2 \ne \sigma _2^2\). But, that was not the case in Examples 1 and 2.

Fig. 5
figure 5

Powers of different tests in Example 1

Figures 5 and 6 also show the results for NBP and SHP-run tests based on \(\rho _{h,\psi }\). Again, the test based on \(\rho _0\) had poor performance, but those based on \(\rho _1\) and \(\rho _2\) performed well. In Example 2, NBP and SHP-run tests based on \(\rho _{h,\psi }\) outperformed those based on \(\varphi _{h,\psi }\), but in Example 1, the tests based on \(\varphi _{h,\psi }\) had an edge. Recall that for the high-dimensional consistency of NBP and SHP-run tests based on \(\varphi _{h,\psi }\), we need \({{\widetilde{e}}}_{h,\psi }(F,G) = \liminf _{d \rightarrow \infty } e_{h,\psi }(F,G) > 0\) (see Theorem 1). Note that \(e_{h,\psi }(F,G)\) can be expressed as \(e_{h,\psi }(F,G)=\{\varphi _{h,\psi }^{*}(F,G) -\varphi _{h,\psi }^{*}(F,F)\}+\{\varphi _{h,\psi }^{*}(F,G) -\varphi _{h,\psi }^{*}(G,G)\}\). Our empirical experience suggests that when both of these terms are positive, tests based on \(\varphi _{h,\psi }\) usually have higher powers than those based on \(\rho _{h,\psi }\). But if they are of opposite signs, the tests based on \(\rho _{h,\psi }\) usually perform better. In Examples 1 and 2, we observed it not only for NBP and SHP-run tests, but for NN and MST-run tests as well.

Fig. 6
figure 6

Powers of different tests in Example 2

Recently, Chen and Friedman (2017) developed a general framework to construct graph-based two-sample tests for multivariate data, where one counts the numbers of \(\mathbf{X}\mathbf{X}\)-type and \(\mathbf{Y}\mathbf{Y}\)-type edges (\(S_{xx}\) and \(S_{yy}\), say) in a sub-graph of \(\mathcal{G}\) and computes their deviations from their expected values under \(\mathcal{H}_0\). The test rejects \(\mathcal{H}_0\) for higher values of the statistic \(T_{CF}={(\mathbf{S}- \text {E}_{\mathcal{H}_0}(\mathbf{S}))}^{\top }\big [\text {var}_{\mathcal{H}_0}(\mathbf{S})\big ]^{-1}(\mathbf{S}- \text {E}_{\mathcal{H}_0}(\mathbf{S}))\), where \(\mathbf{S}=(S_{xx},~ S_{yy})^{\top }\). Chen and Friedman (2017) used k-nearest-neighbor graph (k-NN graph) and MST of \(\mathcal{G}\) for all numerical work. The k-NN graph is an undirected sub-graph of \(\mathcal{G}\), which contains the edge \((\mathbf{u},\mathbf{v})\) if either \(\mathbf{v}\) is among the k nearest-neighbors of \(\mathbf{u}\) or \(\mathbf{u}\) is among the k nearest neighbors of \(\mathbf{v}\). We computed this statistic for the shortest Hamiltonian path and the optimal non-bipartite sub-graph of \(\mathcal{G}\) as well. Powers of these tests (henceforth, referred to as CF-NN, CF-MST, CF-SHP and CF-NBP tests, respectively) are also reported in Figs. 5 and 6. Just like the tests based on \(\rho _0\), these CF tests based on the Euclidean distance also had poor performance in these examples.

Now, to carry out a theoretical study on the high-dimensional behavior of the tests based on \(\rho _{h,\psi }\), in view of Lemma 4, we make the following assumption.

Assumption 5

\({\widetilde{\rho }}_{h,\psi }(F,G) = \liminf _{d \rightarrow \infty } \rho _{h,\psi }^{*}(F,G) > 0\).

From the proof of Lemma 4, one can see that for any fixed d, \(\rho _{h,\psi }^*(F,G)= 0\) if and only if \(e_{F,G}^{(q)}=2 \text {E}\psi (|X_1^{(q)}-Y_1^{(q)}|) - \text {E}\psi (|X_1^{(q)}-X_2^{(q)}|)- \text {E}\psi (|Y_1^{(q)}-Y_2^{(q)}|)=0\) for \(q=1,\ldots ,d\). This quantity \(e_{F,G}^{(q)}\) is an energy distance between the q-th univariate marginals of F and G (see, e.g., Székely and Rizzo 2013) that contains signal against \(\mathcal{H}_0\). Now, \({\widetilde{\rho }}_{h,\psi }(F,G)\) becomes 0 only when \(\liminf _{d \rightarrow \infty } d^{-1}\sum _{q=1}^{d} e_{F,G}^{(q)}=0\). So, Assumption 5 asserts that the average signal is asymptotically non-negligible. In classical asymptotic regime, we consider d to be fixed and expect to get more information as m and n increase. But, in the HDLSS asymptotic regime, where we consider m and n to be fixed, we expect to get more information as d increases. This is ensured by Assumptions 4 and 5. The following theorem shows the high-dimensional consistency of modified NN and MST-run tests based on \(\rho _{h, \psi }\) under these assumptions.

Theorem 4

Suppose that \(\mathbf{X}_1,\ldots ,\mathbf{X}_m \sim F\) and \(\mathbf{Y}_1,\ldots ,\mathbf{Y}_n \sim G\) are independent random vectors, and \(\rho _{h,\psi }\) is used to construct the test statistics, where h and \(\psi \) satisfy the conditions of Lemma 4. Then, under Assumptions 4 and 5, we get the following results.

  1. (a)

    Let \(N_0\) and \(m_0\) be defined as in Theorem 2. If \(k < \min \{m,n\}\) and \(\left( {\begin{array}{c}N_0\\ m_0\end{array}}\right) <\alpha \left( {\begin{array}{c}N\\ m\end{array}}\right) \), then the power of the NN test (of level \(\alpha \)) based on \(\rho _{h,\psi }\) converges to 1 as d tends to infinity.

  2. (b)

    If \(\max \{\lfloor N/m \rfloor , \lfloor N/n \rfloor \} < \alpha \left( {\begin{array}{c}N\\ m\end{array}}\right) \), then the power of the MST-run test (of level \(\alpha \)) based on \(\rho _{h,\psi }\) converges to 1 as d tends to infinity.

This theorem shows that if the sample sizes are not too small (i.e., m and n satisfy the conditions stated in Theorem 2), NN and MST-run tests based on \(\rho _{h,\psi }\) have the high-dimensional consistency. Under Assumptions 4 and 5, SHP-run and NBP tests based on \(\rho _{h,\psi }\) also have this high-dimensional consistency when m and n satisfy the conditions stated in Theorem 1. We state this result in the following theorem. The proof is similar to the proof of Theorem 1. So, we skip the proof.

Theorem 5

Suppose that \(\mathbf{X}_1,\ldots ,\mathbf{X}_m \sim F\) and \(\mathbf{Y}_1,\ldots ,\mathbf{Y}_n \sim G\) are independent random vectors, and \(\rho _{h,\psi }\) is used to construct the test statistics, where h and \(\psi \) satisfy the conditions of Lemma 4. Then, under Assumptions 4 and 5, we get the following results.

  1. (a)

    If \(N/\left( {\begin{array}{c}N\\ m\end{array}}\right) < \alpha \), then the power of the SHP-run test (of level \(\alpha \)) based on \(\rho _{h,\psi }\) converges to 1 as d tends to infinity.

  2. (b)

    Let c(mn) be as defined in Theorem 1. If \(c(m,n) < \alpha \), then the power of the NBP test (of level \(\alpha \)) based on \(\rho _{h,\psi }\) converges to 1 as d tends to infinity.

3.1 Performance under weak signal

In Theorems 4 and 5, we have established the consistency of the tests based on \(\rho _{h,\psi }\) when \( {\widetilde{\rho }}_{h,\psi }(F,G) > 0\), or equivalently, \(\liminf _{d \rightarrow \infty } d^{-1} \sum _{q=1}^d e_{F,G}^{(q)} >0\). So, we need \(\sum _{q=1}^d e_{F,G}^{(q)}\), the total signal against \(\mathcal{H}_0\), to increase at least at the rate of d. But, if only a few of the measurement variables carry information against \(\mathcal{H}_0\), we may have \(\liminf _{d \rightarrow \infty } d^{-1} \sum _{q=1}^d e_{F,G}^{(q)} =0\). Next, we investigate the high-dimensional behavior of the tests based on \(\rho _{h,\psi }\) in such situations. For two independent random vectors \(\mathbf{U},\mathbf{V}\) from F or G, let us assume that \(\text {var}\{\sum _{q=1}^d \psi (|U^{(q)}-V^{(q)}|)\} = \mathbf{O}(\vartheta ^2(d))\). If the measurement variables are highly correlated, we usually have \(\vartheta ^2(d)=\mathbf{O}(d^2)\). But weak dependence among the measurement variables leads to \(\vartheta ^2(d)=\mathbf{o}(d^2)\). For instance, when they are m-dependent, one gets \(\vartheta ^2(d) = d\). Now, for our investigation, we make the following assumption, which is weaker than Assumption 5.

Assumption 6

As d tends to infinity, \(\rho ^*_{h,\psi }(F,G) ~d/\vartheta (d)\) diverges to infinity.

In Assumption 6, we allow \(\rho ^*_{h,\psi }(F,G)\) to converge to 0, but at a rate slower than that of \(\vartheta (d)/d\). For instance, for an m-dependent sequence, we allow \(\rho ^*_{h,\psi }(F,G)\) to converge to 0 at a rate slower than \(d^{-1/2}\). Even when the measurement variables are not m-dependent, under certain weak dependence assumptions on the underlying marginal distributions, we have \(\vartheta ^2(d) = d L(d)\), where L is a slowly varying function (see, e.g., Lin and Lu 1996, Chap. 2). In that case, we allow \(\rho ^*_{h,\psi }(F,G)\) to converge to 0 at a rate slower than \(d^{-1/2} L^{1/2}(d)\). Under Assumption 6, \(\rho _{h,\psi }\) preserves the neighborhood structure in high dimensions when h is Lipschitz continuous, and this ensures the high-dimensional consistency of the resulting tests. The result is stated below.

Theorem 6

Suppose that \(\mathbf{X}_1,\ldots ,\mathbf{X}_m \sim F\) and \(\mathbf{Y}_1,\ldots ,\mathbf{Y}_n \sim G\) are independent random vectors, where F and G satisfy Assumptions 4 and 6. If h is Lipschitz continuous, then

$$\begin{aligned} \text {Pr}\Big [\min _{i,j} \rho _{h,\psi }(\mathbf{X}_i,\mathbf{Y}_j) > \max \big \{\max _{i \ne j} \rho _{h,\psi }(\mathbf{X}_i,\mathbf{X}_j) , \max _{i \ne j} \rho _{h,\psi }(\mathbf{Y}_i,\mathbf{Y}_j)\big \}\Big ] \rightarrow 1 \text { as } d \rightarrow \infty . \end{aligned}$$

Consequently, if m and n satisfy the conditions of Theorem 4, then the powers of NN and MST-run tests (of level \(\alpha \)) based on \(\rho _{h,\psi }\) converge to 1 as d tends to infinity. Similarly, if m and n satisfy the conditions of Theorem 5, then the powers of NBP and SHP-run tests (of level \(\alpha \)) based on \(\rho _{h,\psi }\) also converge to 1 as d tends to infinity.

For independent random vectors \(\mathbf{X}\sim F\) and \(\mathbf{Y}\sim G\), under the assumptions of Theorem 6, we get \(\rho _{h,\psi }(\mathbf{X},\mathbf{Y}) = \rho ^*_{h,\psi }(F,G)\) + \(\mathbf{O}_{P}(\vartheta (d)/d)\) (see the proof of Theorem 6). While \(\rho ^*_{h,\psi }(F,G)\) is the signal against \(\mathcal{H}_0\), the quantity \(\vartheta (d)/d=\sqrt{\text {var}\big \{d^{-1}\sum _{q=1}^d \psi (|X^{(q)}-Y^{(q)}|)\big \}}\) represents the stochastic variation or noise. Theorem 6 shows the high-dimensional consistency of the tests based on \(\rho _{h,\psi }\) when h is Lipschitz continuous and the signal-to-noise ratio diverges with d. Similar results can be obtained even when h is not Lipschitz. For instance, in the case of \(\rho _0\), \(h(t) = \sqrt{t}\) is not Lipschitz continuous, but we have the following result.

Theorem 7

Suppose that \(\mathbf{X}_1,\ldots ,\mathbf{X}_m \sim F\) and \(\mathbf{Y}_1,\ldots ,\mathbf{Y}_n \sim G\) are independent random vectors, where F and G have mean vectors \({\varvec{\mu }}_F, {\varvec{\mu }}_G\) and dispersion matrices \({\varvec{\Sigma }}_F, {\varvec{\Sigma }}_G\), respectively. Further assume that \(\liminf _{d \rightarrow \infty } \min \{trace({\varvec{\Sigma }}_F),trace({\varvec{\Sigma }}_G)\}/\vartheta (d) > 0\), where \(\vartheta ^2(d)\) is the order of \(\text {var}(\Vert \mathbf{W}\Vert ^2)\) for \(\mathbf{W}=\mathbf{X}_1-\mathbf{X}_2\), \(\mathbf{Y}_1-\mathbf{Y}_2\) and \(\mathbf{X}_1-\mathbf{Y}_1\) (i.e., \(\text {var}(\Vert \mathbf{W}\Vert ^2) = \mathbf{O}(\vartheta ^2(d))\)). If \(\Vert {\varvec{\mu }}_F-{\varvec{\mu }}_G\Vert ^2/\vartheta (d)\) or \(|trace({\varvec{\Sigma }}_F) - trace({\varvec{\Sigma }}_G)|/\vartheta (d)\) diverges to infinity as d increases, then

$$\begin{aligned} {\text {Pr}}\Big [\min _{i,j} \rho _{0}(\mathbf{X}_i,\mathbf{Y}_j) > \max \big \{\max _{i \ne j} \rho _{0}(\mathbf{X}_i,\mathbf{X}_j) , \max _{i \ne j} \rho _{0}(\mathbf{Y}_i,\mathbf{Y}_j)\big \}\Big ] \rightarrow 1 \text { as } d \rightarrow \infty . \end{aligned}$$

Consequently, if m and n satisfy the conditions of Theorem 4, then the powers of NN and MST-run tests (of level \(\alpha \)) based on \(\rho _0\) converge to 1 as d tends to infinity. Similarly, if m and n satisfy the conditions of Theorem 5, then the powers of NBP and SHP-run tests (of level \(\alpha \)) based on \(\rho _0\) also converge to 1 as d tends to infinity.

Thus, when the measurement variables are m-dependent, for the consistency of the tests based on \(\rho _0\), we need either \(d^{-1/2} \Vert {\varvec{\mu }}_F - {\varvec{\mu }}_G\Vert ^2\) or \(d^{-1/2} |trace({\varvec{\Sigma }}_F)-trace({\varvec{\Sigma }}_G)|\) to diverge to infinity as d increases. This condition is much weaker than the condition ‘\(d^{-1} \Vert {\varvec{\mu }}_F - {\varvec{\mu }}_G\Vert ^2>0\) or \(d^{-1} |trace({\varvec{\Sigma }}_F)-trace({\varvec{\Sigma }}_G)|>0\)’ (i.e., \(\nu ^2>0\) or \(\sigma _1^2\ne \sigma _2^2\)) assumed for the tests based on the Euclidean distance.

3.2 Computational issues

Computation of MADD between two data points has an associated cost of the order \(\mathbf{O}(dn)\) compared to \(\mathbf{O}(d)\) needed for the Euclidean distance or \(\varphi _{h,\psi }\). But in the HDLSS set up, where d is much larger than n, these are of the same asymptotic order. Moreover, after computing all pairwise distances, the steps used for obtaining the test statistics are the same in all cases. Therefore, for HDLSS data, though the tests based on MADD require more time compared to the corresponding tests based on the Euclidean distance or \(\varphi _{h,\psi }\), the time difference is very small. This is quite evident from Table 1, which shows average computing times required by different tests based on the Euclidean distance and \(\rho _0\) for various dimensions and sample sizes. We used MATLAB codes for the NN test, the MST-run test and the SHP-run test, while the codes for the NBP test were written in R. All the codes were run on a computer with 16 GB RAM, having Intel Core i5 CPU with a clock speed of 1.60 GHz.

Table 1 Average run times (of 100 trials) for different tests (in seconds)

4 Results from the analysis of simulated and real data sets

In this section, we analyze a few more simulated data sets and two real data sets for further evaluation of the tests based on \(\varphi _{h,\psi }\) and \(\rho _{h,\psi }\). For each of these examples, we repeated the experiment 500 times to compute the powers of different tests, which are shown in Figs. 7, 8, 9, 10, 11, 12 and 13. We also report the results for the usual tests based on the Euclidean distance and those proposed by Chen and Friedman (2017) to facilitate comparison. For all simulated data sets, we used \(m=n=20\) as before.

4.1 Analysis of simulated data sets

We start with two examples (Examples 3 and 4), where both populations have i.i.d. measurement variables, and they differ only in their locations and scales. In Example 3 (respectively, Example 4), these measurement variables have normal (respectively, Cauchy) distribution. While they have location parameter 0 and scale parameter 1 for the first population, those are 0.1 and 1.1 for the second population. Powers of different tests in these two examples are shown in Fig. 7. One can see that in the presence of normal distribution (which is light-tailed) in Example 3, the tests based on \(\psi _1(t)=t\) performed better than the ones based on \(\psi _2(t)=1-\exp (-t)\). But, we observed a completely different picture in Example 4. In the presence of Cauchy distribution (which is heavy-tailed), all tests based on \(\psi _1\) had powers close to the nominal level, but those based on \(\psi _2\) had excellent performance. This shows the robustness of our proposed tests based on bounded \(\psi \) functions against heavy-tailed distributions. CF tests based on the Euclidean distance performed well in Example 3, but just like other tests based on the Euclidean distance, due to lack of robustness, they failed in Example 4.

Fig. 7
figure 7

Powers of different tests in Examples 3 and 4

Figure 7 also shows that in both examples, modified NN and MST-run tests based on \(\varphi _{h,\psi }\) performed very poorly. Note that in both of these examples, the scale-difference between two distributions dominates the location-difference, and for all three choices of h and \(\psi \), we have \(\varphi _{h,\psi }^{*}(F,F)< \varphi _{h,\psi }^{*}(F,G) < \varphi _{h,\psi }^{*}(G,G)\). This violation of neighborhood structure was the reason behind the failure of these tests. As it is expected from our earlier discussion, in these examples, modified versions of SHP-run and NBP tests based on \(\rho _{h,\psi }\) performed much better than those based on \(\varphi _{h,\psi }\).

Our next two examples involve alternatives with sparse signals, where only a fraction of the measurement variables contain information against \(\mathcal{H}_0\), and that fraction shrinks to 0 as the dimension increases. So, Assumption 5 does not hold in these examples.

Example 5

We consider two normal distributions \(\mathcal{N}_d(\mathbf{0}_d,\mathbf{I}_d)\) and \(\mathcal{N}_d({\varvec{\mu }}_d,{\varvec{\Lambda }}_d)\) as F and G, respectively, where \({\varvec{\mu }}_d = (\mu _1,\ldots ,\mu _d)^\top \) with \(\mu _i = \sqrt{0.01\log (d)}\) for \(i=1,\ldots ,\lfloor d^{1/2}\rfloor \) and 0 otherwise. The diagonal matrix \({\varvec{\Lambda }}_d\) has the first \(\lfloor d^{1/2}\rfloor \) elements equal to \(0.5\log (d)\) and the rest equal to 1.

In this example, modified tests based on \(\rho _0\) outperformed all other tests considered here (see Fig. 8). This is quite expected because here the two distributions have light tails, and they differ only in their locations and scales. Note that for \(\rho _0\), \(\sum _{q=1}^{d} e_{F,G}^{(q)}\) is of the order \(\mathbf{O}(d^{1/2} \log (d))\), while for \(\rho _2\), it is of the order \(\mathbf{O}(d^{1/2})\) since the function \(\psi _2\) is bounded. In this example, \(\nu ^2\) have smaller order than \(|\sigma _F^2-\sigma _G^2|\) (here \(\nu ^2\), \(\sigma _F^2\) and \(\sigma _G^2\) have the same meaning as in Assumption 3). So, NN and MST-run tests based on the Euclidean distance had powers close to zero. But, the SHP-run test based on the Euclidean distance performed much better. This is consistent with our theoretical results.

Fig. 8
figure 8

Powers of different tests in Example 5

Example 6

Both F and G have independent measurement variables. For F, they are i.i.d. \(\mathcal{N}(0,1)\). For G, the first \(\lfloor d^{2/3} \rfloor \) variables are \(t_3(0,1/3)\), and the rest are \(\mathcal{N}(0,1)\). So, F and G have the same location and dispersion structure, while the first \(\lfloor d^{2/3} \rfloor \) marginal distributions differ in their shapes.

In this example, we observed a different picture (see Fig. 9) than what we observed in Example 5. Modified tests based on \(\rho _1\) and \(\rho _2\), particularly those based on \(\rho _2\) performed much better than all other tests considered here. For the SHP-run test, modified versions based on \(\varphi _{h,\psi }\) with \(\psi _1\) and \(\psi _2\) had powers comparable to those based on \(\rho _1\) and \(\rho _2\), respectively. Note that in this example, two distributions have the same location and scale, but they differ in their univariate marginal distributions. In such a case, pairwise Euclidean distances fail to extract the signal against \(\mathcal{H}_0\). So, CF tests, the tests based on the Euclidean distance and those based on \(\rho _0\), all had powers close to the nominal level.

Fig. 9
figure 9

Powers of different tests in Example 6

In our last two examples, we consider some situations where the assumptions used for our theoretical investigations do not hold. These examples are used to investigate how our proposed tests perform when the assumptions are violated.

Example 7

We take F to be an equal mixture of \(\mathcal{N}_d(0.3\mathbf{1}_d,\mathbf{I}_d)\) and \(\mathcal{N}_d(-0.3\mathbf{1}_d,4\mathbf{I}_d)\), while G is taken to be an equal mixture of \(\mathcal{N}_d(0.3{\varvec{\alpha }}_d,\mathbf{I}_d)\) and \(\mathcal{N}_d(-0.3{\varvec{\alpha }}_d,4\mathbf{I}_d)\), where \({\varvec{\alpha }}_d = (1,-1,\ldots ,(-1)^{d+1})^\top \).

In this example, NN and MST-run tests based on the Euclidean distance performed poorly (see Fig. 10), and the same was observed for CF-NN and CF-MST tests as well. However, SHP-run and NBP tests based on the Euclidean distance and the corresponding CF tests (CF-SHP and CF-NBP) performed well. Tests based on \(\rho _{h,\psi }\), especially the ones based on \(\rho _0\) had much superior performance in this example. Note that in this example, Assumptions 14 do not hold for the mixture distributions, but they hold for each component distribution. If we consider them as separate distributions, using the distance concentration phenomenon, we can explain the reasons behind the poor performance of the NN and MST-run tests based on the Euclidean distance and the superiority of their modifications based on MADD. Since the component distributions are light-tailed, the performance of the tests based on \(\rho _0\) and \(\rho _1\) was much better than the tests based on \(\rho _2\). The same was observed for the corresponding tests based on \(\varphi _{h,\psi }\) as well.

Fig. 10
figure 10

Powers of different tests in Example 7

Example 8

We take F to be the \(\mathcal{N}_d(\mathbf{0}_d, 3\mathbf{I}_d)\) distribution, while G is taken to be the standard multivariate t distribution with 3 degrees of freedom.

Here two populations have the same mean vector and the same dispersion matrix, while they differ only in their shapes. The results reported in Fig. 11 shows that the modified tests based on MADD performed much better in this example. The tests based on the Euclidean distance and \(\varphi _{h,\psi }\) failed for NN and MST-run tests, but they performed well for SHP-run and NBP tests. CF-SHP and CF-NBP tests also performed well, but CF-NN and CF-MST tests showed some strange behaviors. The powers of these two tests initially increased with the dimension, but then they dropped down to zero as the dimension increased further.

Fig. 11
figure 11

Powers of different tests in Example 8

4.2 Analysis of benchmark data sets

We analyzed two benchmark data sets, taken from the UCR Time Series Classification Archive (http://www.cs.ucr.edu/~eamonn/time_series_data/). These data sets, namely the Gun-Point data and the Lighting-2 data, have been extensively used in the literature of supervised classification. It is well-known that there is a reasonable separation between the two distributions in both of these data sets. So, assuming \(\mathcal{H}_0\) to be false, we compared different tests based on their powers. These data sets consist of separate training and test samples. For our analysis, we merged these sets and following Biswas et al. (2014), we used random sub-samples of different sizes from the whole data set keeping the proportions of observations from different distributions as close as they are in the original data set. Each experiment was repeated 500 times to compute the powers of different tests, and they are shown in Figs. 12 and 13.

Fig. 12
figure 12

Powers of different tests in Gun-Point data

Gun-Point data set comes from the video surveillance domain. It contains 100 observations from each of two classes: Gun-Draw and Point. For Gun-Draw, an actor draws a gun from a hip-mounted holster, points it at a target for approximately one second, and then returns the gun to the holster. For Point, the actor does the same move, but instead of the gun, points the index finger to the target for approximately one second, and then returns to the initial position. For each class, an observation consists of 150 measurements corresponding to the X co-ordinate of the centroid of the actor’s right hand during one movement. In this data set, modified tests based on \(\varphi _{h,\psi }\) performed better than the tests based on the Euclidean distance, MADD and CF tests (see Fig. 12). Among these modified tests, the one based on \(\psi _2\) had a slight edge. The performance of the tests based on the Euclidean distance and MADD were quite similar, but the CF tests had inferior performance in all cases.

Lightning-2 data set consists of observations from two classes: Cloud-to-Ground lightning and Intra-Cloud lightning. Each observation corresponds to transient electromagnetic events detected by FORTE satellite. Every input went through a Fourier transform to get a spectrogram, which was then collapsed in frequency to produce a power density time series. These time series were smoothed to produce 637-dimensional observations. The data set consists of 48 and 73 observations from the two classes. Figure 13 shows that the tests based on \(\varphi _{h,\psi }\) and \(\rho _{h,\psi }\) had superior performance in this example when \(\psi _1\) or \(\psi _2\) was used. Among these tests, the ones based on \(\psi _{2}(t)=1-\exp (-t)\) performed better. Tests based on \(\rho _0\) and those based on the Euclidean distance had almost similar performance. CF tests did not have satisfactory performance in this example. Powers of theses tests were much lower than all other tests considered here.

Fig. 13
figure 13

Powers of different tests in Lightning-2 data

5 Concluding remarks

In this article, we have used a new class of distance functions, \(\varphi _{h,\psi }\), and associated MADD indices \(\rho _{h,\psi }\) to modify some popular graph-based two-sample tests constructed using pairwise distances. In high dimension, while usual tests based on pairwise Euclidean distances are mainly useful for populations differing in their locations and overall scales, our modified tests can perform well for a larger variety of alternatives. There are some relative merits and demerits of these two types of modifications. For SHP-run and NBP tests, the first type of modification based on \(\varphi _{h,\psi }\) is sufficient for the high-dimensional consistency. But this is not the case for NN and MST-run tests, where the modification based on \(\varphi _{h,\psi }\) can produce poor results. In such cases, the use of MADD improves the performance of the tests significantly. When the neighborhood structure is retained by \(\varphi _{h,\psi }\), the modified tests based on it perform better than the corresponding version based on MADD. But, when the neighborhood structure is violated, the use of MADD is preferred. For both \(\varphi _{h,\psi }\) and \(\rho _{h,\psi }\), the use of bounded \(\psi \) functions (e.g., \(\psi (t)=1-\exp (-t)\)) make the resulting tests robust against outliers generated from heavy-tailed distributions. But, for light-tailed distributions, the use of linear or square functions (i.e., \(\psi (t)=t\) or \(\psi (t)=t^2)\) produce better results.

Our general recipe based on MADD can also be used to improve the high-dimensional performance of many other two-sample tests. For instance, we can modify the tests based on averages of pairwise distances (see, e.g., Baringhaus and Franz 2004, 2010; Biswas and Ghosh 2014; Tsukada 2019). CF tests (Chen and Friedman 2017) can also be modified using \(\varphi _{h,\psi }\) or \(\rho _{h, \psi }\). High-dimensional consistency of the resulting tests can be proved using arguments similar to those used in this article. These modified average-distance-based and graph-based tests have relative merits and demerits. While average-distance-based tests (see, e.g., Sarkar and Ghosh 2018) put more emphasis on the magnitude of the observations, graph-based tests rely mainly on different types of edge-counts. When the underlying distributions are light-tailed, average-distance-based tests usually perform better than graph-based tests. On the other hand, graph-based tests are more robust against contamination and outliers generated from heavy-tailed distributions. As a result, they often outperform average-distance-based tests when the underlying distributions have heavy tails. Our empirical experience also suggests that there are several other cases when graph-based tests are preferred. For instance, when one or more of the underlying distributions are mixtures of several distributions, average-distance-based tests may perform poorly while graph-based tests can produce excellent results. However, it needs an extensive investigation, which is beyond the scope of this article. Using our ideas based on \(\varphi _{h,\psi }\) or \(\rho _{h, \psi }\), several multi-sample tests can also be modified to achieve better performance in high dimensions. For the NN test and its modified versions, throughout this article, we have reported all numerical results for \(k=3\) only. However, our findings remained almost the same for other values of k as well. This is expected in view of the theoretical results stated in this article.

For the construction of the general version of MADD [see Eq. (2)], we have used transformation on each of the measurement variables. This type of construction helps us to detect difference between the univariate marginals. However, this method can fail in situations where the distributions have same marginals, but differ in their joint structures. In such situations, following Sarkar and Ghosh (2018, Section 3), one can partition the measurement vector \(\mathbf{x}\) into K non-overlapping blocks \({{\tilde{\mathbf{x}}}}_1,\ldots ,{{\tilde{\mathbf{x}}}}_{K}\) of sizes \(d_1,\ldots ,d_K\) (\(\sum _{i=1}^{K} d_i=d\)), respectively, and define MADD using blocked distance functions of the form \(\varphi _{h,\psi }^{B}(\mathbf{x},\mathbf{y})\) = \(h\{K^{-1} \sum _{q=1}^{K} \psi (\Vert {{\tilde{\mathbf{x}}}}_{q}-{{\tilde{\mathbf{y}}}}_{q}\Vert )\}\). As long as the block sizes are uniformly bounded, and the joint distributions of variables in the blocks are different under F and G, consistency of the resulting tests based on MADD can be proved under conditions similar to Assumptions 46. One can also see that similar results can be proved for the corresponding tests based on \(\varphi _{h,\psi }^{B}\) as well. This type of blocking can reveal more minute differences between two distributions. For instance, using blocks of size 2, one can distinguish between two distributions having the same univariate marginals but different correlation structures (see, e.g., Sarkar and Ghosh 2018). In that case, ideally, one would like to put highly correlated variables in the same block. In general, we would like to find blocks which are nearly independent, but the variables inside a block have significant dependence among themselves. But, at this moment, it is not yet clear how to develop an algorithm for finding such optimal blocks from the data. This can be considered as an interesting problem for future research.

6 Codes

Codes for implementation of different tests proposed in the article can be found at https://drive.google.com/open?id=1Ym8r6reNDfZrQ79LP4MJ93ybNc8-A2b1.