当前位置: X-MOL 学术Nucleic Acids Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improving the value of public RNA-seq expression data by phenotype prediction
Nucleic Acids Research ( IF 14.9 ) Pub Date : 2018-03-05 , DOI: 10.1093/nar/gky102
Shannon E Ellis 1, 2 , Leonardo Collado-Torres 2, 3 , Andrew Jaffe 1, 2, 3, 4 , Jeffrey T Leek 1, 2
Affiliation  

Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible.

中文翻译:

通过表型预测提高公共 RNA-seq 表达数据的价值

公开的基因组数据是研究正常人类变异和疾病的宝贵资源,但这些数据通常没有很好的标记或注释。公共基因组数据缺乏表型信息,严重限制了它们解决目标生物学问题的效用。我们开发了一种计算机表型分析方法,使用 TCGA 和 GTEx 等联盟产生的注释良好的基因组和表型数据作为训练数据,直接从基因组测量中预测关键缺失注释。我们将计算机表型分析应用于最近在公共管道上处理的一组 70,000 个 RNA-seq 样本,作为recount2项目的一部分。我们使用基因表达数据来构建和评估生物表型(性别、组织、样本来源)和实验条件(测序策略)的预测因子。我们演示了如何使用这些预测来研究公共基因组数据的跨样本特性,选择具有特定特征的基因组项目,并使用预测的表型进行下游分析。phenopredict R 包中提供了执行表型预测的方法, recount R 包中提供了recount2的预测。有了 70,000 个人类样本的数据和表型信息,表达数据就可以在以前不可行的规模上使用。
更新日期:2018-03-05
down
wechat
bug