当前位置: X-MOL 学术bioRxiv. Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Sparse Regression with Application to the UK Biobank
bioRxiv - Bioinformatics Pub Date : 2020-05-31 , DOI: 10.1101/630079
Junyang Qian , Yosuke Tanigawa , Wenfei Du , Matthew Aguirre , Chris Chang , Robert Tibshirani , Manuel A. Rivas , Trevor Hastie

The UK Biobank (Bycroft et al., 2018) is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with GWAS, have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso (Tibshirani, 1996), since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet (Friedman et al., 2010a) and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports l1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with l1/l2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve superior predictive performance on quantitative and qualitative traits including height, body mass index, asthma and high cholesterol.

中文翻译:

快速,可扩展的大规模和超高维稀疏回归框架及其在英国生物库中的应用

UK Biobank(Bycroft et al。,2018)是英国范围内基于人口的前瞻性队列研究。它为研究人员提供了前所未有的机会来研究基因型信息与感兴趣的表型之间的关系。与GWAS相比,多种回归方法已经显示出可以大大提高各种表型的预测性能。在高维环境中,套索(Tibshirani,1996)自统计中的第一个建议以来,已被证明是同时进行变量选择和估计的有效方法。但是,由于许多现有算法及其实现都无法扩展到大型应用,因此在英国生物银行中看到的大规模和超高维度对套索方法的应用提出了新的挑战。在本文中,我们提出了一个称为批处理筛选迭代套索(BASIL)的计算框架,该框架可以利用任何现有的套索求解器,并轻松地为非常大的数据(包括那些大于内存大小的数据)构建可伸缩的解决方案。我们介绍了snpnet,这是一个R包,可在glmnet之上实施所提出的算法(Friedman等人,2010a)并针对单核苷酸多态性(SNP)数据集进行优化。它目前支持l1惩罚线性模型,逻辑回归,Cox模型,并且以l1 / l2罚分扩展到弹性网。我们在UK Biobank数据集上证明了结果,在该数据集上,我们在定量和定性特征(包括身高,体重指数,哮喘和高胆固醇)方面均具有出色的预测性能。
更新日期:2020-05-31
down
wechat
bug