当前位置: X-MOL 学术Metabolomics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A comparative evaluation of the generalised predictive ability of eight machine learning algorithms across ten clinical metabolomics data sets for binary classification.
Metabolomics ( IF 3.5 ) Pub Date : 2019-11-15 , DOI: 10.1007/s11306-019-1612-4
Kevin M Mendez 1 , Stacey N Reinke 1 , David I Broadhurst 1
Affiliation  

INTRODUCTION Metabolomics is increasingly being used in the clinical setting for disease diagnosis, prognosis and risk prediction. Machine learning algorithms are particularly important in the construction of multivariate metabolite prediction. Historically, partial least squares (PLS) regression has been the gold standard for binary classification. Nonlinear machine learning methods such as random forests (RF), kernel support vector machines (SVM) and artificial neural networks (ANN) may be more suited to modelling possible nonlinear metabolite covariance, and thus provide better predictive models. OBJECTIVES We hypothesise that for binary classification using metabolomics data, non-linear machine learning methods will provide superior generalised predictive ability when compared to linear alternatives, in particular when compared with the current gold standard PLS discriminant analysis. METHODS We compared the general predictive performance of eight archetypal machine learning algorithms across ten publicly available clinical metabolomics data sets. The algorithms were implemented in the Python programming language. All code and results have been made publicly available as Jupyter notebooks. RESULTS There was only marginal improvement in predictive ability for SVM and ANN over PLS across all data sets. RF performance was comparatively poor. The use of out-of-bag bootstrap confidence intervals provided a measure of uncertainty of model prediction such that the quality of metabolomics data was observed to be a bigger influence on generalised performance than model choice. CONCLUSION The size of the data set, and choice of performance metric, had a greater influence on generalised predictive performance than the choice of machine learning algorithm.

中文翻译:

八种机器学习算法在十个临床代谢组学数据集上的二元分类的广义预测能力的比较评估。

引言代谢组学在临床中越来越多地用于疾病的诊断,预后和风险预测。机器学习算法在构建多元代谢物预测中特别重要。历史上,偏最小二乘(PLS)回归一直是二进制分类的金标准。诸如随机森林(RF),内核支持向量机(SVM)和人工神经网络(ANN)之类的非线性机器学习方法可能更适合于对可能的非线性代谢物协方差进行建模,从而提供更好的预测模型。目标我们假设,对于使用代谢组学数据进行的二元分类,与线性替代方法相比,非线性机器学习方法将提供出色的广义预测能力,特别是与目前的金标准PLS判别分析相比。方法我们在十个可公开获得的临床代谢组学数据集中比较了八种原型机器学习算法的一般预测性能。这些算法以Python编程语言实现。所有代码和结果均已作为Jupyter笔记本公开发布。结果在所有数据集上,SVM和ANN的预测能力仅优于PLS。射频性能相对较差。袋外引导程序置信区间的使用提供了模型预测的不确定性度量,因此观察到的代谢组学数据质量比模型选择对广义性能的影响更大。结论数据集的大小以及性能指标的选择,
更新日期:2019-11-15
down
wechat
bug