Feature screening with large-scale and high-dimensional survival data,Biometrics

当前位置： X-MOL 学术 › Biometrics › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Feature screening with large-scale and high-dimensional survival data
Biometrics ( IF 1.4 ) Pub Date : 2021-04-21 , DOI: 10.1111/biom.13479
Grace Y Yi ₁ , Wenqing He ₂ , Raymond J Carroll _{3,

4}

Affiliation

Data with a huge size present great challenges in modeling, inferences, and computation. In handling big data, much attention has been directed to settings with “large p small n”, and relatively less work has been done to address problems with p and n being both large, though data with such a feature have now become more accessible than before, where p represents the number of variables and n stands for the sample size. The big volume of data does not automatically ensure good quality of inferences because a large number of unimportant variables may be collected in the process of gathering informative variables. To carry out valid statistical analysis, it is imperative to screen out noisy variables that have no predictive value for explaining the outcome variable. In this paper, we develop a screening method for handling large-sized survival data, where the sample size n is large and the dimension p of covariates is of non-polynomial order of the sample size n, or the so-called NP-dimension. We rigorously establish theoretical results for the proposed method and conduct numerical studies to assess its performance. Our research offers multiple extensions of existing work and enlarges the scope of high-dimensional data analysis. The proposed method capitalizes on the connections among useful regression settings and offers a computationally efficient screening procedure. Our method can be applied to different situations with large-scale data including genomic data.

中文翻译：

大规模高维生存数据特征筛选

海量数据对建模、推理和计算提出了巨大挑战。在处理大数据时，很多注意力都集中在“大p小n ”的设置上，而针对p和n都很大的问题所做的工作相对较少，尽管具有这种特征的数据现在比之前，其中p代表变量的数量，n代表样本量。大量的数据并不能自动保证良好的推理质量，因为在收集信息变量的过程中可能会收集到大量不重要的变量。为了进行有效的统计分析，必须筛选出对解释结果变量没有预测价值的噪声变量。在本文中，我们开发了一种用于处理大型生存数据的筛选方法，其中样本量n很大，协变量的维数p是样本量n的非多项式阶，或所谓的 NP 维。我们为所提出的方法严格建立理论结果，并进行数值研究以评估其性能。我们的研究为现有工作提供了多种扩展，并扩大了高维数据分析的范围。所提出的方法利用有用的回归设置之间的联系，并提供计算高效的筛选程序。我们的方法可以应用于具有包括基因组数据在内的大规模数据的不同情况。

更新日期：2021-04-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11