当前位置: X-MOL 学术Biostatistics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models.
Biostatistics ( IF 1.8 ) Pub Date : 2018-09-06 , DOI: 10.1093/biostatistics/kxy044
Yuqing Zhang 1 , Christoph Bernau 2 , Giovanni Parmigiani 3, 4 , Levi Waldron 5
Affiliation  

Cross-study validation (CSV) of prediction models is an alternative to traditional cross-validation (CV) in domains where multiple comparable datasets are available. Although many studies have noted potential sources of heterogeneity in genomic studies, to our knowledge none have systematically investigated their intertwined impacts on prediction accuracy across studies. We employ a hybrid parametric/non-parametric bootstrap method to realistically simulate publicly available compendia of microarray, RNA-seq, and whole metagenome shotgun microbiome studies of health outcomes. Three types of heterogeneity between studies are manipulated and studied: (i) imbalances in the prevalence of clinical and pathological covariates, (ii) differences in gene covariance that could be caused by batch, platform, or tumor purity effects, and (iii) differences in the "true" model that associates gene expression and clinical factors to outcome. We assess model accuracy, while altering these factors. Lower accuracy is seen in CSV than in CV. Surprisingly, heterogeneity in known clinical covariates and differences in gene covariance structure have very limited contributions in the loss of accuracy when validating in new studies. However, forcing identical generative models greatly reduces the within/across study difference. These results, observed consistently for multiple disease outcomes and omics platforms, suggest that the most easily identifiable sources of study heterogeneity are not necessarily the primary ones that undermine the ability to accurately replicate the accuracy of omics prediction models in new studies. Unidentified heterogeneity, such as could arise from unmeasured confounding, may be more important.

中文翻译:

不同来源的异质性对基因组预测模型准确性损失的影响。

预测模型的交叉研究验证 (CSV) 是在多个可比较数据集可用的领域中传统交叉验证 (CV) 的替代方案。尽管许多研究已经注意到基因组研究中异质性的潜在来源,但据我们所知,没有人系统地研究它们对跨研究预测准确性的相互交织的影响。我们采用混合参数/非参数引导方法来真实地模拟公开可用的微阵列、RNA-seq 和整个宏基因组鸟枪微生物组研究健康结果的概要。操纵和研究了研究之间的三种异质性:(i)临床和病理协变量的普遍性不平衡,(ii)可能由批次、平台或肿瘤纯度效应引起的基因协方差差异,(iii) 将基因表达和临床因素与结果相关联的“真实”模型的差异。我们评估模型的准确性,同时改变这些因素。CSV 中的准确度低于 CV。令人惊讶的是,在新研究中进行验证时,已知临床协变量的异质性和基因协方差结构的差异对准确性损失的贡献非常有限。然而,强制相同的生成模型大大减少了研究内部/交叉研究的差异。这些结果在多种疾病结果和组学平台上一致观察到,表明最容易识别的研究异质性来源不一定是破坏在新研究中准确复制组学预测模型准确性能力的主要来源。未识别的异质性,
更新日期:2020-04-17
down
wechat
bug