当前位置: X-MOL 学术Int. J. Med. Inform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Assessing reproducibility and veracity across machine learning techniques in biomedicine: A case study using TCGA data.
International Journal of Medical Informatics ( IF 4.9 ) Pub Date : 2020-05-13 , DOI: 10.1016/j.ijmedinf.2020.104148
Ahyoung Amy Kim 1 , Samir Rachid Zaim 2 , Vignesh Subbian 3
Affiliation  

Background

Many studies that aim to identify gene biomarkers using statistical methods and translate them into FDA-approved drugs have faced challenges due to lack of clinical validity and methodological reproducibility. Since genomic data analysis relies heavily on these statistical learning tools more than before, it is vital to address the limitations of these computational techniques.

Methods

Our study demonstrates these methodological gaps among most common statistical learning techniques used in gene expression analysis. To assess the classification ability and reproducibility of statistical learning tools for gene biomarker detection, six state-of-the-art machine learning models were trained on four different cancer data retrieved from The Cancer Genome Atlas (TCGA). Standard performance metrics including specificity, sensitivity, precision, and F1 score were evaluated to investigate the classification ability. For analysis of reproducibility, the identifiability of gene classifiers was examined by quantifying the consistency of the chosen classifier genes.

Results

Among the six state-of-the-art machine learning methods, the random forest had the best classification ability overall. Very few genes were selected by multiple methods, which suggests poor identifiability and reproducibility of statistical learning methods for gene expression data. Our results demonstrated the challenges of reproducing discoveries from gene expression analysis due to the inherent differences that exist in statistical machine learning methods.

Conclusion

Since statistical machine learning models can have large variations in high-dimensional settings such as analysis of gene expression data, transparent analysis procedures including data preprocessing, model parameterization, and evaluation and choice of interpretable models are required for clinical validity and utility.



中文翻译:

评估生物医学中各种机器学习技术的可重复性和准确性:使用TCGA数据的案例研究。

背景

由于缺乏临床有效性和方法学上的可重复性,许多旨在使用统计方法识别基因生物标记并将其转化为FDA批准的药物的研究都面临着挑战。由于基因组数据分析比以前更多地依赖于这些统计学习工具,因此解决这些计算技术的局限性至关重要。

方法

我们的研究证明了在基因表达分析中最常用的统计学习技术之间的这些方法学差距。为了评估用于基因生物标志物检测的统计学习工具的分类能力和可重复性,针对从癌症基因组图谱(TCGA)检索到的四种不同的癌症数据,对六个最先进的机器学习模型进行了训练。评估标准性能指标,包括特异性,敏感性,精密度和F1分数,以研究分类能力。为了分析可重复性,通过量化所选分类器基因的一致性来检查基因分类器的可识别性。

结果

在六种最新的机器学习方法中,随机森林总体上具有最佳分类能力。通过多种方法选择的基因很少,这表明用于基因表达数据的统计学习方法的可识别性和可重复性很差。我们的结果证明了由于统计机器学习方法中存在的固有差异,从基因表达分析中再现发现的挑战。

结论

由于统计机器学习模型在诸如基因表达数据分析等高维设置中可能会有很大的差异,因此需要具有透明性的分析程序,包括数据预处理,模型参数化以及评估和选择可解释模型,以确保临床有效性和实用性。

更新日期:2020-05-13
down
wechat
bug