当前位置: X-MOL 学术Proc. Natl. Acad. Sci. U.S.A. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Methods for correcting inference based on outcomes predicted by machine learning [Statistics]
Proceedings of the National Academy of Sciences of the United States of America ( IF 11.1 ) Pub Date : 2020-12-01 , DOI: 10.1073/pnas.2001238117
Siruo Wang 1 , Tyler H McCormick 2, 3 , Jeffrey T Leek 4
Affiliation  

Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference. In this paper, we develop methods for correcting statistical inference using outcomes predicted with arbitrarily complicated machine-learning models including random forests and deep neural nets. Rather than trying to derive the correction from first principles for each machine-learning algorithm, we observe that there is typically a low-dimensional and easily modeled representation of the relationship between the observed and predicted outcomes. We build an approach for postprediction inference that naturally fits into the standard machine-learning framework where the data are divided into training, testing, and validation sets. We train the prediction model in the training set, estimate the relationship between the observed and predicted outcomes in the testing set, and use that relationship to correct subsequent inference in the validation set. We show our postprediction inference (postpi) approach can correct bias and improve variance estimation and subsequent statistical inference with predicted outcomes. To show the broad range of applicability of our approach, we show postpi can improve inference in two distinct fields: modeling predicted phenotypes in repurposed gene expression data and modeling predicted causes of death in verbal autopsy data. Our method is available through an open-source R package: https://github.com/leekgroup/postpi.



中文翻译:

基于机器学习预测的结果修正推理的方法 [统计]

医学和公共卫生领域的许多现代问题都利用机器学习方法根据可观察的协变量来预测结果。在广泛的环境中,预测结果用于随后的统计分析,通常不考虑观察结果和预测结果之间的区别。我们将具有预测结果的推理称为后预测推理。在本文中,我们开发了使用任意复杂的机器学习模型(包括随机森林和深度神经网络)预测的结果来纠正统计推断的方法。我们没有尝试从每个机器学习算法的第一原理中得出修正,而是观察到观察结果和预测结果之间的关系通常存在低维且易于建模的表示。我们构建了一种后预测推理方法,该方法自然适合标准机器学习框架,其中数据分为训练、测试和验证集。我们在训练集中训练预测模型,估计测试集中观察到的和预测的结果之间的关系,并使用这种关系来纠正验证集中的后续推理。我们展示了我们的预测后推理 (postpi) 方法可以纠正偏差并改进方差估计和随后的统计推断与预测结果。为了展示我们方法的广泛适用性,我们展示了 postpi 可以改善两个不同领域的推理:在重新利用的基因表达数据中对预测的表型进行建模,在口头尸检数据中对预测的死因进行建模。

更新日期:2020-12-02
down
wechat
bug