当前位置: X-MOL 学术Brief. Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Large-scale benchmark study of survival prediction methods using multi-omics data.
Briefings in Bioinformatics ( IF 6.8 ) Pub Date : 2020-08-22 , DOI: 10.1093/bib/bbaa167
Moritz Herrmann 1 , Philipp Probst 2 , Roman Hornung 2 , Vindi Jurinovic 2 , Anne-Laure Boulesteix 2
Affiliation  

Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database ‘The Cancer Genome Atlas’ (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan–Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno’s C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups—especially clinical variables—from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contact:moritz.herrmann@stat.uni-muenchen.de, +49 89 2180 3198 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.

中文翻译:

使用多组学数据进行生存预测方法的大规模基准研究。

多组学数据,即包含不同类型高维分子变量的数据集,越来越多地用于各种疾病的研究。尽管如此,关于多组学数据在预测疾病结果(如生存时间)方面的有用性仍然存在问题。还不清楚哪种方法最适合推导出此类预测模型。我们旨在通过使用真实数据的大规模基准研究来回答这些问题。来自机器学习和统计的不同预测方法应用于来自数据库“癌症基因组图谱”(TCGA)的 18 个多组学癌症数据集(35 到 1000 次观察,多达 100 000 个变量)。考虑的结果是(审查的)生存时间。基于boosting的11种方法,比较了惩罚回归和随机森林,包括考虑和不考虑组学变量组结构的两种方法。Kaplan-Meier 估计和仅使用临床变量的 Cox 模型用作参考方法。使用多次重复的 5 倍交叉验证来比较这些方法。Uno 的 C 指数和综合 Brier 分数用作绩效指标。结果表明,考虑到多组学结构的方法具有稍好的预测性能。考虑到这种结构可以保护低维组中的预测信息——尤其是临床变量——在预测过程中不被利用。此外,只有块森林方法在平均上优于 Cox 模型,并且仅略胜一筹。这表明,联系方式: moritz.herrmann@stat.uni-muenchen.de,+49 89 2180 3198补充信息:补充数据可在在线生物信息学简报中获得。所有分析都可以使用 Github 上免费提供的 R 代码重现。
更新日期:2020-08-22
down
wechat
bug