当前位置: X-MOL 学术Mach. Learn. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat
Machine Learning ( IF 7.5 ) Pub Date : 2019-10-23 , DOI: 10.1007/s10994-019-05848-5
Nastasiya F Grinberg 1, 2 , Oghenejokpeme I Orhobor 1 , Ross D King 3
Affiliation  

In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial.

中文翻译:

机器学习预测表型的评估:酵母、水稻和小麦的研究

在表型预测中,生物体的物理特征是根据其基因型和环境的知识来预测的。此类研究通常称为全基因组关联研究,具有最高的社会重要性,因为它们对医学、作物育种等至关重要。我们研究了三个表型预测问题:一个是简单干净的(酵母),另一个是另外两个复杂且现实的世界(水稻和小麦)。我们比较了标准的机器学习方法;弹性网、岭回归、套索回归、随机森林、梯度提升机(GBM)和支持向量机(SVM),具有两种最先进的经典统计遗传学方法;基因组 BLUP 和基于线性回归的两步序贯方法。此外,使用干净的酵母数据,我们研究了性能如何随生物机制的复杂性、观察噪声量、示例数量、缺失数据量以及不同数据表示的使用而变化。我们发现,对于几乎所有考虑的表型,标准机器学习方法都优于经典统计遗传学方法。在酵母问题上,最成功的方法是GBM,其次是套索回归,以及两种统计遗传学方法;在机械复杂性较高的情况下,GBM 是最好的,而在更简单的情况下,套索则更优越。在小麦和水稻研究中,最好的两种方法是 SVM 和 BLUP。在存在噪声、缺失数据等情况下最稳健的方法是随机森林。人们发现基因组 BLUP 的经典统计遗传学方法在解决存在群体结构的问题时表现良好。这表明标准机器学习方法需要改进,以在存在人口结构信息时将其包括在内。我们的结论是,机器学习方法在表型预测问题上的应用前景广阔,但确定哪些方法可能在任何给定问题上表现良好是难以捉摸且不平凡的。
更新日期:2019-10-23
down
wechat
bug