Abstract
Technological advances have enabled us to collect a lot of complex data objects, where homogeneity structure among these objects is widely used in Statistics. However, the existing metrics of homogeneity are subject to some qualifications, such as assumptions about the moment and parameters. To overcome the limitation, this paper first introduces the characteristic distance, a novel metric that entirely characterizes the homogeneity of two distributions. The proposed distance possesses some desirable statistical properties: (i) It is a distribution-free or, more commonly, nonparametric test, thus is robust to the data; (ii) It is nonnegative and equal to zero if and only if the two distributions are homogeneous; (iii) The novel measure possesses a clear and intuitive probabilistic interpretation, moreover, its empirical version is easy to calculate and can be reduced to a sum of two V-statistics. Theoretically, the asymptotic distributions, including the mixture of \(\chi ^{2}\) distributions under the null hypothesis and the asymptotic normality of the alternative hypothesis are thoroughly investigated. Simulation studies and a real data application suggest that the empirical characteristic distance has a preferable power in detecting the homogeneity of distributions.
Similar content being viewed by others
References
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300
Bickel PJ (1969) A distribution free version of the Smirnov two sample test in the p-variate case. Ann Math Stat 40(1):1–23
Biswas M, Ghosh AK (2014) A nonparametric two-sample test applicable to high dimensional data. J Multivar Anal 123:160–171
Chakraborty S, Zhang X (2021) A new framework for distance and kernel-based metrics in high dimensions. Electron J Stat 15(2):5455–5522
Chung J, Fraser D (1958) Randomization tests for a multivariate two-sample problem. J Am Stat Assoc 53(283):729–735
Fernández VA, Gamero MJ, Garcia JM (2008) A test for the two-sample problem based on empirical characteristic functions. Comput Stat Data Anal 52(7):3730–3748
Friedman JH, Rafsky LC (1979) Multivariate generalizations of the Wald–Wolfowitz and Smirnov two-sample tests. Ann Stat 7(4):697–717
Gentleman R, Irizarry RA, Carey VJ, Dudoit S, Huber W (2005) Bioinformatics and computational biology solutions using R and bioconductor. Springer, New York
Gretton A, Borgwardt KM, Rasch M, Schölkopf B, Smola AJ (2007) A kernel method for the two-sample-problem. Adv Neural Inf Process Syst 19:513–520
Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola AJ (2012) A kernel two-sample test. J Mach Learn Res 13:723–773
Harchaoui Z, Bach F, Cappe O, Moulines E (2013) Kernel-based methods for hypothesis testing: a unified view. IEEE Signal Process Mag 30(4):87–97
Kim I, Balakrishnan S, Wasserman L (2020) Robust multivariate nonparametric tests via projection averaging. Ann Stat 48(6):3417–3441
Koroljuk VS, Borovskich YV (1994) Theory of U-statistics. Kluwer Academic Publisher, Amsterdam
Lee AJ (1990) U-statistics: theory and practice statistics: textbooks and monographs 110. Dekker Inc., New York
Lee D, Lahiri SN, Sinha S (2020) A test of homogeneity of distributions when observations are subject to measurement errors. Biometrics 76(3):821–833
Neuhaus G (1977) Functional limit theorems for U-statistics in the degenerate case. J Multivariate Anal 7:424–439
Pan W, Tian Y, Wang X, Zhang H (2018) Ball divergence: nonparametric two sample test. Ann Stat 46(3):1109–1137
Sejdinovic D, Sriperumbudur B, Gretton A, Fukumizu K (2013) Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann Stat 41(5):2263–2291
Serfling RJ (1980) Approximation theorems of mathematical statistics. Wiley, New York
Smirnoff N (1939) On the estimation of the discrepancy between empirical curves of distribution for two independent samples. Bulletin de lUniversite de Moscow Serie internationale (Mathematiques) 2:3–14
Székely GJ (2002) E-statistics: the energy of statistical samples. Technical report
Székely GJ, Rizzo ML (2004) Testing for equal distributions in high dimension. InterStat 5:1–8
Wald A, Wolfowitz J (1940) On a test whether two samples are from the same population. Ann Math Stat 11(2):147–162
Xiaochun L (2009) ALL: A data package. R package version 1.22.0
Yiming L, Zhi L, Wang Z (2019) A test for equality of two distributions via integrating characteristic functions. Stat Sin 29(4):1779–1801
Zhi L, Xiaochao X, Wang Z (2015) A test for equality of two distributions via jackknife empirical likelihood and characteristic functions. Comput Stat Data Anal 92:97–114
Zhu C, Shao X (2021) Interpoint distance based two sample tests in high dimension. Bernoulli 27(2):1189–1211
Acknowledgements
This work is partially supported by the National Natural Science Foundation of China (Grant No. 12071267).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, X., Hu, W. & Zhang, B. Measuring and testing homogeneity of distributions by characteristic distance. Stat Papers 64, 529–556 (2023). https://doi.org/10.1007/s00362-022-01327-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-022-01327-7