Relaxing the assumptions of knockoffs by conditioning,Annals of Statistics

当前位置： X-MOL 学术 › Ann. Stat. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Relaxing the assumptions of knockoffs by conditioning
Annals of Statistics ( IF 3.2 ) Pub Date : 2020-10-01 , DOI: 10.1214/19-aos1920
Dongming Huang , Lucas Janson

The recent paper Candes et al. (2018) introduced model-X knockoffs, a method for variable selection that provably and non-asymptotically controls the false discovery rate with no restrictions or assumptions on the dimensionality of the data or the conditional distribution of the response given the covariates. The one requirement for the procedure is that the covariate samples are drawn independently and identically from a precisely-known (but arbitrary) distribution. The present paper shows that the exact same guarantees can be made without knowing the covariate distribution fully, but instead knowing it only up to a parametric model with as many as $\Omega(n^{*}p)$ parameters, where $p$ is the dimension and $n^{*}$ is the number of covariate samples (which may exceed the usual sample size $n$ of labeled samples when unlabeled samples are also available). The key is to treat the covariates as if they are drawn conditionally on their observed value for a sufficient statistic of the model. Although this idea is simple, even in Gaussian models conditioning on a sufficient statistic leads to a distribution supported on a set of zero Lebesgue measure, requiring techniques from topological measure theory to establish valid algorithms. We demonstrate how to do this for three models of interest, with simulations showing the new approach remains powerful under the weaker assumptions.

中文翻译：

通过调节放宽仿冒品的假设

最近的论文 Candes 等人。(2018) 引入了模型 X 仿制，这是一种变量选择方法，可证明和非渐近地控制错误发现率，对数据的维度或给定协变量的响应的条件分布没有限制或假设。该过程的一个要求是协变量样本是从精确已知（但任意）的分布中独立且相同地抽取的。本论文表明，在不完全了解协变量分布的情况下，可以做出完全相同的保证，而只知道具有多达 $\Omega(n^{*}p)$ 参数的参数模型，其中 $p$ 是维度，$n^{*}$ 是协变量样本的数量（当未标记样本也可用时，可能会超过标记样本的通常样本大小 $n$）。关键是对待协变量，就好像它们是有条件地根据它们的观察值绘制的，以获得模型的足够统计量。尽管这个想法很简单，但即使在以足够统计量为条件的高斯模型中，也会导致分布在一组零勒贝格测度上，这需要拓扑测度理论的技术来建立有效的算法。我们演示了如何对三个感兴趣的模型执行此操作，模拟显示新方法在较弱的假设下仍然强大。尽管这个想法很简单，但即使在以足够统计量为条件的高斯模型中，也会导致分布在一组零勒贝格测度上，这需要拓扑测度理论的技术来建立有效的算法。我们演示了如何对三个感兴趣的模型执行此操作，模拟显示新方法在较弱的假设下仍然强大。尽管这个想法很简单，但即使在以足够统计量为条件的高斯模型中，也会导致分布在一组零勒贝格测度上，这需要拓扑测度理论的技术来建立有效的算法。我们演示了如何对三个感兴趣的模型执行此操作，模拟显示新方法在较弱的假设下仍然强大。

更新日期：2020-10-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文