当前位置: X-MOL 学术Stat. Anal. Data Min. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A machine learning method for selection of genetic variants to increase prediction accuracy of type 2 diabetes mellitus using sequencing data
Statistical Analysis and Data Mining ( IF 2.1 ) Pub Date : 2020-04-04 , DOI: 10.1002/sam.11456
Luann C. Jung 1 , Haiyan Wang 2 , Xukun Li 2 , Cen Wu 2
Affiliation  

Type 2 diabetes mellitus (T2DM) affects millions of people through its life‐altering complications. Worldwide, 3.4 million people die of diabetes annually. Studying the effect of genetic polymorphism on T2DM has been plagued by the available sample size. A 2016 Nature Reviews article summarized that the accuracy of predicting future type 2 diabetes from genetic polymorphism is very low at the population level. Innumerable associations between genes, environmental factors, and type 2 diabetes remain to be discovered. This research presents a method to identify subtle effects of genetic variants using whole genome sequencing data and improve prediction accuracy of T2DM at the population level. To achieve this, a new feature selection procedure and a classifier are proposed. The method involves (a) first applying sparse principal component analysis to genotype data to obtain orthogonal features; (b) building a new classifier using single nucleotide polymorphism (SNP)‐specific regularization parameters to reduce the false positive rate of feature selection; (c) verifying feature relevance through penalized logistic regression. After application to a dataset containing 625 597 SNPs and 23 environmental variables from each of 3326 humans, the method identified 271 genetic variants with subtle effects on T2DM prediction. These variants led to greatly improved prediction accuracy for new patients at the population level. The proposed method also has the advantage of computational efficiency, over 15 times faster than random forest and extreme gradient boosting (XGBoost) classifiers, and thus provides a promising tool for large‐scale genome‐wide association studies.

中文翻译:

一种使用序列数据选择遗传变异以提高2型糖尿病预测准确性的机器学习方法

2型糖尿病(T2DM)通过改变生命的并发症影响数百万人。全世界每年有340万人死于糖尿病。可用样本量困扰着研究遗传多态性对T2DM的影响。A 2016 Nature评论文章总结说,从基因多态性预测未来2型糖尿病的准确性在人群水平上非常低。基因,环境因素和2型糖尿病之间无数的关联仍有待发现。这项研究提出了一种使用全基因组测序数据来鉴定遗传变异的微妙影响并提高人群中T2DM预测准确性的方法。为此,提出了一种新的特征选择过程和分类器。该方法包括:(a)首先将稀疏主成分分析应用于基因型数据以获得正交特征;(b)使用单核苷酸多态性(SNP)特定的正则化参数构建新的分类器,以减少特征选择的假阳性率; (c)通过惩罚逻辑回归来验证特征相关性。在将数据应用于包含来自3326人中的每个人的625 597个SNP和23个环境变量的数据集之后,该方法确定了271个对T2DM预测具有微妙影响的遗传变异。这些变体大大提高了新患者在人群水平上的预测准确性。所提出的方法还具有计算效率高的优点,比随机森林和极端梯度增强(XGBoost)分类器快15倍以上,因此为大规模的全基因组关联研究提供了有希望的工具。这些变体大大提高了新患者在人群水平上的预测准确性。所提出的方法还具有计算效率高的优点,比随机森林和极端梯度增强(XGBoost)分类器快15倍以上,因此为大规模的全基因组关联研究提供了有希望的工具。这些变体大大提高了新患者在人群水平上的预测准确性。所提出的方法还具有计算效率高的优点,比随机森林和极端梯度增强(XGBoost)分类器快15倍以上,因此为大规模的全基因组关联研究提供了有希望的工具。
更新日期:2020-04-04
down
wechat
bug