当前位置: X-MOL 学术Biostatistics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A divide-and-conquer method for sparse risk prediction and evaluation.
Biostatistics ( IF 2.1 ) Pub Date : 2020-09-10 , DOI: 10.1093/biostatistics/kxaa031
Chuan Hong 1 , Yan Wang 1 , Tianxi Cai 2
Affiliation  

Divide-and-conquer (DAC) is a commonly used strategy to overcome the challenges of extraordinarily large data, by first breaking the dataset into series of data blocks, then combining results from individual data blocks to obtain a final estimation. Various DAC algorithms have been proposed to fit a sparse predictive regression model in the |$L_1$| regularization setting. However, many existing DAC algorithms remain computationally intensive when sample size and number of candidate predictors are both large. In addition, no existing DAC procedures provide inference for quantifying the accuracy of risk prediction models. In this article, we propose a screening and one-step linearization infused DAC (SOLID) algorithm to fit sparse logistic regression to massive datasets, by integrating the DAC strategy with a screening step and sequences of linearization. This enables us to maximize the likelihood with only selected covariates and perform penalized estimation via a fast approximation to the likelihood. To assess the accuracy of a predictive regression model, we develop a modified cross-validation (MCV) that utilizes the side products of the SOLID, substantially reducing the computational burden. Compared with existing DAC methods, the MCV procedure is the first to make inference on accuracy. Extensive simulation studies suggest that the proposed SOLID and MCV procedures substantially outperform the existing methods with respect to computational speed and achieve similar statistical efficiency as the full sample-based estimator. We also demonstrate that the proposed inference procedure provides valid interval estimators. We apply the proposed SOLID procedure to develop and validate a classification model for disease diagnosis using narrative clinical notes based on electronic medical record data from Partners HealthCare.

中文翻译:

稀疏风险预测和评估的分治法。

分而治之 (DAC) 是克服超大数据挑战的常用策略,首先将数据集分解为一系列数据块,然后将各个数据块的结果组合以获得最终估计。已经提出了各种 DAC 算法来拟合|$L_1$|中的稀疏预测回归模型。正则化设置。然而,当样本大小和候选预测变量的数量都很大时,许多现有的 DAC 算法仍然是计算密集型的。此外,没有现有的 DAC 程序为量化风险预测模型的准确性提供推理。在本文中,我们提出了一种筛选和一步线性化注入 DAC (SOLID) 算法,通过将 DAC 策略与筛选步骤和线性化序列相结合,将稀疏逻辑回归拟合到海量数据集。这使我们能够仅使用选定的协变量来最大化似然性,并通过对似然性的快速逼近来执行惩罚估计。为了评估预测回归模型的准确性,我们开发了一种改进的交叉验证 (MCV),它利用了 SOLID 的副产品,大大减少了计算负担。与现有的 DAC 方法相比,MCV 程序是第一个对精度进行推断的方法。广泛的模拟研究表明,所提出的 SOLID 和 MCV 程序在计算速度方面大大优于现有方法,并实现了与基于完整样本的估计器相似的统计效率。我们还证明了所提出的推理过程提供了有效的区间估计量。我们应用建议的 SOLID 程序来开发和验证疾病诊断的分类模型,该模型使用基于 Partners HealthCare 的电子病历数据的叙述性临床记录。广泛的模拟研究表明,所提出的 SOLID 和 MCV 程序在计算速度方面大大优于现有方法,并实现了与基于完整样本的估计器相似的统计效率。我们还证明了所提出的推理过程提供了有效的区间估计量。我们应用建议的 SOLID 程序来开发和验证疾病诊断的分类模型,该模型使用基于 Partners HealthCare 的电子病历数据的叙述性临床记录。广泛的模拟研究表明,所提出的 SOLID 和 MCV 程序在计算速度方面大大优于现有方法,并实现了与基于完整样本的估计器相似的统计效率。我们还证明了所提出的推理过程提供了有效的区间估计量。我们应用建议的 SOLID 程序来开发和验证疾病诊断的分类模型,该模型使用基于 Partners HealthCare 的电子病历数据的叙述性临床记录。
更新日期:2020-09-10
down
wechat
bug