当前位置: X-MOL 学术Biometrika › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fast and powerful conditional randomization testing via distillation
Biometrika ( IF 2.7 ) Pub Date : 2021-07-02 , DOI: 10.1093/biomet/asab039
Molei Liu 1 , Eugene Katsevich 2 , Lucas Janson 3 , Aaditya Ramdas 4
Affiliation  

Summary We consider the problem of conditional independence testing: given a response $Y$ and covariates $(X,Z)$, we test the null hypothesis that $Y {\perp\!\!\!\perp} X \mid Z$. The conditional randomization test was recently proposed as a way to use distributional information about $X\mid Z$ to exactly and nonasymptotically control Type-I error using any test statistic in any dimensionality without assuming anything about $Y\mid (X,Z)$. This flexibility, in principle, allows one to derive powerful test statistics from complex prediction algorithms while maintaining statistical validity. Yet the direct use of such advanced test statistics in the conditional randomization test is prohibitively computationally expensive, especially with multiple testing, due to the requirement to recompute the test statistic many times on resampled data. We propose the distilled conditional randomization test, a novel approach to using state-of-the-art machine learning algorithms in the conditional randomization test while drastically reducing the number of times those algorithms need to be run, thereby taking advantage of their power and the conditional randomization test’s statistical guarantees without suffering the usual computational expense. In addition to distillation, we propose a number of other tricks, like screening and recycling computations, to further speed up the conditional randomization test without sacrificing its high power and exact validity. Indeed, we show in simulations that all our proposals combined lead to a test that has similar power to most powerful existing conditional randomization test implementations, but requires orders of magnitude less computation, making it a practical tool even for large datasets. We demonstrate these benefits on a breast cancer dataset by identifying biomarkers related to cancer stage.

中文翻译:

通过蒸馏进行快速而强大的条件随机化测试

总结 我们考虑条件独立性测试的问题:给定响应 $Y$ 和协变量 $(X,Z)$,我们测试原假设 $Y {\perp\!\!\!\perp} X \mid Z $。最近提出了条件随机化测试,作为一种使用有关 $X\mid Z$ 的分布信息来精确且非渐近地控制 I 类错误的方法,使用任何维度的任何测试统计量,而不对 $Y\mid (X,Z) 进行任何假设$。原则上,这种灵活性允许人们从复杂的预测算法中得出强大的测试统计数据,同时保持统计有效性。然而,在条件随机化测试中直接使用这种高级测试统计量的计算成本过高,尤其是在多次测试时,因为需要对重新采样的数据多次重新计算测试统计量。我们提出了蒸馏条件随机化测试,这是一种在条件随机化测试中使用最先进的机器学习算法的新颖方法,同时大大减少这些算法需要运行的次数,从而充分利用它们的功能和条件随机化检验的统计保证无需承受通常的计算费用。除了蒸馏之外,我们还提出了许多其他技巧,例如筛选和回收计算,以进一步加速条件随机化测试,而不牺牲其高功效和精确有效性。事实上,我们在模拟中表明,我们所有的建议组合在一起导致的测试与最强大的现有条件随机化测试实现具有相似的能力,但需要的计算量要少几个数量级,即使对于大型数据集,它也是一个实用的工具。我们通过识别与癌症分期相关的生物标志物,在乳腺癌数据集上展示了这些好处。
更新日期:2021-07-02
down
wechat
bug