当前位置: X-MOL 学术BMC Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses
BMC Bioinformatics ( IF 3 ) Pub Date : 2020-10-01 , DOI: 10.1186/s12859-020-03755-4
Elisabetta Manduchi 1, 2 , Weixuan Fu 2 , Joseph D Romano 1 , Stefano Ruberto 1 , Jason H Moore 1, 2
Affiliation  

A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis. We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids ‘leakage’ during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj . In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field.

中文翻译:

在用于生物医学大数据分析的基于树的自动化机器学习中嵌入协变量调整

生物信息学中的一项典型任务包括识别哪些特征与感兴趣的目标结果相关并构建预测模型。自动化机器学习 (AutoML) 系统,例如基于树的管道优化工具 (TPOT),构成了实现这一目标的有吸引力的方法。然而,在生物医学数据中,研究对象的基线特征或批次效应通常需要进行调整,以便更好地隔离感兴趣的特征对目标的影响。因此,执行协变量调整的能力对于 AutoML 在生物医学大数据分析中的应用变得尤为重要。我们开发了一种方法来调整影响 TPOT 中的特征和/或目标的协变量。我们的方法基于在交叉验证训练过程中以一种避免“泄漏”的方式回归协变量。我们描述了这种方法在毒物基因组学和精神分裂症基因表达数据集上的应用。这项工作中讨论的 TPOT 扩展可在 https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj 获得。在这项工作中,我们解决了 AutoML 背景下的一个重要需求,这对于生物信息学和医学信息学的应用尤其重要,即协变量调整。为此,我们提出了 TPOT 的实质性扩展,这是一种基于遗传编程的 AutoML 方法。我们通过应用于大型毒物基因组学和差异基因表达数据来展示这种扩展的效用。该方法通常适用于生物医学领域的许多其他场景。
更新日期:2020-10-02
down
wechat
bug