当前位置: X-MOL 学术Biol. Direct › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data
Biology Direct ( IF 5.7 ) Pub Date : 2020-12-10 , DOI: 10.1186/s13062-020-00287-y
Julie Chih-Yu Chen 1 , Andrea D Tyler 1
Affiliation  

The advent of metagenomic sequencing provides microbial abundance patterns that can be leveraged for sample origin prediction. Supervised machine learning classification approaches have been reported to predict sample origin accurately when the origin has been previously sampled. Using metagenomic datasets provided by the 2019 CAMDA challenge, we evaluated the influence of variable technical, analytical and machine learning approaches for result interpretation and novel source prediction. Comparison between 16S rRNA amplicon and shotgun sequencing approaches as well as metagenomic analytical tools showed differences in normalized microbial abundance, especially for organisms present at low abundance. Shotgun sequence data analyzed using Kraken2 and Bracken, for taxonomic annotation, had higher detection sensitivity. As classification models are limited to labeling pre-trained origins, we took an alternative approach using Lasso-regularized multivariate regression to predict geographic coordinates for comparison. In both models, the prediction errors were much higher in Leave-1-city-out than in 10-fold cross validation, of which the former realistically forecasted the increased difficulty in accurately predicting samples from new origins. This challenge was further confirmed when applying the model to a set of samples obtained from new origins. Overall, the prediction performance of the regression and classification models, as measured by mean squared error, were comparable on mystery samples. Due to higher prediction error rates for samples from new origins, we provided an additional strategy based on prediction ambiguity to infer whether a sample is from a new origin. Lastly, we report increased prediction error when data from different sequencing protocols were included as training data. Herein, we highlight the capacity of predicting sample origin accurately with pre-trained origins and the challenge of predicting new origins through both regression and classification models. Overall, this work provides a summary of the impact of sequencing technique, protocol, taxonomic analytical approaches, and machine learning approaches on the use of metagenomics for prediction of sample origin.

中文翻译:

使用宏基因组测序数据对用于样本来源预测的监督机器学习进行系统评估

宏基因组测序的出现提供了可用于样本来源预测的微生物丰度模式。据报道,当先前对来源进行采样时,监督机器学习分类方法可以准确预测样本来源。使用 2019 年 CAMDA 挑战赛提供的宏基因组数据集,我们评估了可变技术、分析和机器学习方法对结果解释和新源预测的影响。16S rRNA 扩增子和鸟枪法测序方法以及宏基因组分析工具之间的比较显示归一化微生物丰度存在差异,特别是对于低丰度存在的生物。使用 Kraken2 和 Bracken 分析的用于分类注释的 Shotgun 序列数据具有更高的检测灵敏度。由于分类模型仅限于标记预训练的原点,我们采用了另一种方法,使用 Lasso 正则化多元回归来预测地理坐标以进行比较。在这两个模型中,Leave-1-city-out 的预测误差远高于 10 倍交叉验证,其中前者真实地预测了准确预测来自新来源的样本的难度增加。当将该模型应用于从新来源获得的一组样本时,进一步证实了这一挑战。总体而言,回归和分类模型的预测性能(以均方误差衡量)在神秘样本上具有可比性。由于来自新来源的样本的预测错误率较高,我们提供了一种基于预测歧义的额外策略来推断样本是否来自新的来源。最后,当来自不同测序协议的数据作为训练数据包含时,我们报告了增加的预测误差。在这里,我们强调了使用预训练来源准确预测样本来源的能力,以及通过回归和分类模型预测新来源的挑战。总的来说,这项工作总结了测序技术、协议、分类分析方法和机器学习方法对使用宏基因组学预测样本来源的影响。我们强调了使用预训练来源准确预测样本来源的能力,以及通过回归和分类模型预测新来源的挑战。总的来说,这项工作总结了测序技术、协议、分类分析方法和机器学习方法对使用宏基因组学预测样本来源的影响。我们强调了使用预训练来源准确预测样本来源的能力,以及通过回归和分类模型预测新来源的挑战。总的来说,这项工作总结了测序技术、协议、分类分析方法和机器学习方法对使用宏基因组学预测样本来源的影响。
更新日期:2020-12-11
down
wechat
bug