On some graph-based two-sample tests for high dimension, low sample size data,Machine Learning

当前位置： X-MOL 学术 › Mach. Learn. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

On some graph-based two-sample tests for high dimension, low sample size data
Machine Learning ( IF 4.3 ) Pub Date : 2019-11-13 , DOI: 10.1007/s10994-019-05857-4
Soham Sarkar , Rahul Biswas , Anil K. Ghosh

Testing for the equality of two high-dimensional distributions is a challenging problem, and this becomes even more challenging when the sample size is small. Over the last few decades, several graph-based two-sample tests have been proposed in the literature, which can be used for data of arbitrary dimensions. Most of these test statistics are computed using pairwise Euclidean distances among the observations. But, due to concentration of pairwise Euclidean distances, these tests have poor performance in many high-dimensional problems. Some of them can have powers even below the nominal level when the scale-difference between two distributions dominates the location-difference. To overcome these limitations, we introduce a new class of dissimilarity indices and use it to modify some popular graph-based tests. These modified tests use the distance concentration phenomenon to their advantage, and as a result, they outperform the corresponding tests based on the Euclidean distance in a wide variety of examples. We establish the high-dimensional consistency of these modified tests under fairly general conditions. Analyzing several simulated as well as real data sets, we demonstrate their usefulness in high dimension, low sample size situations.

中文翻译：

一些基于图的高维、低样本数据的双样本检验

测试两个高维分布的相等性是一个具有挑战性的问题，当样本量很小时，这变得更具挑战性。在过去的几十年里，文献中提出了几种基于图的双样本检验，可用于任意维度的数据。大多数这些测试统计量是使用观测值之间的成对欧几里得距离计算的。但是，由于成对欧几里德距离的集中，这些测试在许多高维问题中表现不佳。当两个分布之间的尺度差异主导位置差异时，其中一些甚至可以具有低于名义水平的权力。为了克服这些限制，我们引入了一类新的相异指数，并使用它来修改一些流行的基于图的测试。这些修改后的测试利用了距离集中现象的优势，因此，它们在各种示例中都优于基于欧几里得距离的相应测试。我们在相当普遍的条件下建立了这些修改测试的高维一致性。通过分析几个模拟和真实数据集，我们证明了它们在高维、低样本量情况下的有用性。

更新日期：2019-11-13

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11