当前位置: X-MOL 学术 › Genome Res › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci.
Genome research Pub Date : 2019-09-19 , DOI: 10.1101/gr.246462.118
Jonathan M Mudge 1 , Irwin Jungreis 2, 3 , Toby Hunt 1 , Jose Manuel Gonzalez 1 , James C Wright 4 , Mike Kay 1 , Claire Davidson 1 , Stephen Fitzgerald 5 , Ruth Seal 1, 6 , Susan Tweedie 1 , Liang He 2, 3 , Robert M Waterhouse 7, 8 , Yue Li 2, 3 , Elspeth Bruford 1, 6 , Jyoti S Choudhary 4 , Adam Frankish 1 , Manolis Kellis 2, 3
Affiliation  

The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignments. Here, we present the first whole-genome PhyloCSF prediction tracks for human, mouse, chicken, fly, worm, and mosquito. We develop a workflow that uses machine learning to predict novel conserved protein-coding regions and efficiently guide their manual curation. We analyze more than 1000 high-scoring human PhyloCSF regions and confidently add 144 conserved protein-coding genes to the GENCODE gene set, as well as additional coding regions within 236 previously annotated protein-coding genes, and 169 pseudogenes, most of them disabled after primates diverged. The majority of these represent new discoveries, including 70 previously undetected protein-coding genes. The novel coding genes are additionally supported by single-nucleotide variant evidence indicative of continued purifying selection in the human lineage, coding-exon splicing evidence from new GENCODE transcripts using next-generation transcriptomic data sets, and mass spectrometry evidence of translation for several new genes. Our discoveries required simultaneous comparative annotation of other vertebrate genomes, which we show is essential to remove spurious ORFs and to distinguish coding from pseudogene regions. Our new coding regions help elucidate disease-associated regions by revealing that 118 GWAS variants previously thought to be noncoding are in fact protein altering. Altogether, our PhyloCSF data sets and algorithms will help researchers seeking to interpret these genomes, while our new annotations present exciting loci for further experimental characterization.

中文翻译:

全基因组 PhyloCSF 发现的高可信度人类蛋白质编码基因和外显子有助于阐明 118 个 GWAS 位点。

DNA 最广为人知的作用是编码蛋白质,但被翻译的人类基因组的确切部分仍有待确定。我们之前开发了 PhyloCSF,这是一种广泛使用的工具,可使用多物种基因组比对来识别蛋白质编码区域的进化特征。在这里,我们展示了人类、小鼠、鸡、苍蝇、蠕虫和蚊子的第一个全基因组 PhyloCSF 预测轨迹。我们开发了一个工作流程,使用机器学习来预测新的保守蛋白质编码区域并有效地指导他们的手动管理。我们分析了 1000 多个高分人类 PhyloCSF 区域,并自信地将 144 个保守的蛋白质编码基因添加到 GENCODE 基因集中,以及 236 个先前注释的蛋白质编码基因和 169 个假基因中的额外编码区域,他们中的大多数在灵长类动物分化后残疾。其中大部分代表新发现,包括 70 个以前未检测到的蛋白质编码基因。新的编码基因还得到单核苷酸变异证据的支持,这些证据表明人类谱系中的持续纯化选择,使用下一代转录组数据集的新 GENCODE 转录本的编码外显子剪接证据,以及几个新基因翻译的质谱证据. 我们的发现需要同时对其他脊椎动物基因组进行比较注释,我们证明这对于去除虚假的 ORF 和区分编码与假基因区域至关重要。我们的新编码区域通过揭示 118 个以前被认为是非编码的 GWAS 变体实际上是蛋白质改变来帮助阐明疾病相关区域。共,
更新日期:2019-11-01
down
wechat
bug