Multivariate survival analysis in big data: A divide-and-combine approach,Biometrics

当前位置： X-MOL 学术 › Biometrics › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multivariate survival analysis in big data: A divide-and-combine approach
Biometrics ( IF 1.4 ) Pub Date : 2021-04-13 , DOI: 10.1111/biom.13469
Wei Wang ₁ , Shou-En Lu ₁ , Jerry Q Cheng ₂ , Minge Xie ₃ , John B Kostis ₄

Affiliation

Multivariate failure time data are frequently analyzed using the marginal proportional hazards models and the frailty models. When the sample size is extraordinarily large, using either approach could face computational challenges. In this paper, we focus on the marginal model approach and propose a divide-and-combine method to analyze large-scale multivariate failure time data. Our method is motivated by the Myocardial Infarction Data Acquisition System (MIDAS), a New Jersey statewide database that includes 73,725,160 admissions to nonfederal hospitals and emergency rooms (ERs) from 1995 to 2017. We propose to randomly divide the full data into multiple subsets and propose a weighted method to combine these estimators obtained from individual subsets using three weights. Under mild conditions, we show that the combined estimator is asymptotically equivalent to the estimator obtained from the full data as if the data were analyzed all at once. In addition, to screen out risk factors with weak signals, we propose to perform the regularized estimation on the combined estimator using its combined confidence distribution. Theoretical properties, such as consistency, oracle properties, and asymptotic equivalence between the divide-and-combine approach and the full data approach are studied. Performance of the proposed method is investigated using simulation studies. Our method is applied to the MIDAS data to identify risk factors related to multivariate cardiovascular-related health outcomes.

中文翻译：

大数据中的多变量生存分析：一种分而治之的方法

多变量故障时间数据经常使用边际比例风险模型和脆弱性模型进行分析。当样本量非常大时，使用任何一种方法都可能面临计算挑战。在本文中，我们专注于边际模型方法，并提出了一种分而结合的方法来分析大规模多变量故障时间数据。我们的方法受到心肌梗死数据采集系统 (MIDAS) 的启发，该系统是新泽西州的一个全州数据库，其中包括 1995 年至 2017 年非联邦医院和急诊室 (ER) 的 73,725,160 例入院数据。我们建议将完整数据随机分成多个子集，提出了一种加权方法，使用三个权重组合从各个子集获得的这些估计量。在温和的条件下，我们表明，组合估计量渐近等效于从完整数据中获得的估计量，就好像数据是一次性分析的一样。此外，为了筛选出信号较弱的风险因素，我们建议使用其组合置信度分布对组合估计量进行正则化估计。研究了分合法和全数据法之间的一致性、预言机性质和渐近等价性等理论性质。使用模拟研究研究了所提出方法的性能。我们的方法应用于 MIDAS 数据，以确定与多变量心血管相关健康结果相关的风险因素。为了筛选出信号较弱的风险因素，我们建议使用其组合置信度分布对组合估计量进行正则化估计。研究了分合法和全数据法之间的一致性、预言机性质和渐近等价性等理论性质。使用模拟研究研究了所提出方法的性能。我们的方法应用于 MIDAS 数据，以确定与多变量心血管相关健康结果相关的风险因素。为了筛选出信号较弱的风险因素，我们建议使用其组合置信度分布对组合估计量进行正则化估计。研究了分合法和全数据法之间的一致性、预言机性质和渐近等价性等理论性质。使用模拟研究研究了所提出方法的性能。我们的方法应用于 MIDAS 数据，以确定与多变量心血管相关健康结果相关的风险因素。使用模拟研究研究了所提出方法的性能。我们的方法应用于 MIDAS 数据，以确定与多变量心血管相关健康结果相关的风险因素。使用模拟研究研究了所提出方法的性能。我们的方法应用于 MIDAS 数据，以确定与多变量心血管相关健康结果相关的风险因素。

更新日期：2021-04-13

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11