当前位置: X-MOL 学术Genome Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics
Genome Research ( IF 7 ) Pub Date : 2017-12-01 , DOI: 10.1101/gr.218255.116
Ulrich Omasits , Adithi R. Varadarajan , Michael Schmid , Sandra Goetze , Damianos Melidis , Marc Bourqui , Olga Nikolayeva , Maxime Québatte , Andrea Patrignani , Christoph Dehio , Juerg E. Frey , Mark D. Robinson , Bernd Wollscheid , Christian H. Ahrens

Accurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations. Our strategy toward accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources, ab initio gene prediction algorithms and in silico ORFs (a modified six-frame translation considering alternative start codons) in an integrated proteogenomics database (iPtgxDB) that covers the entire protein-coding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensive Bartonella henselae proteomics data set against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and single amino acid variations (SAAVs) identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin. We release iPtgxDBs for B. henselae, Bradyrhizobium diazoefficiens and Escherichia coli and the software to generate both proteogenomics search databases and integrated annotation files that can be viewed in a genome browser for any prokaryote.



中文翻译:

通过蛋白质组学鉴定原核基因组的全部蛋白编码潜力的综合策略

正确注释所有蛋白质编码序列(CDS)是充分利用快速增长的完整测序原核基因组库的必要前提。但是,由不同资源注释的CDS数量之间的巨大差异,功能性短开放阅读框架(sORF)的缺失以及对伪造ORF的过度预测代表了严重的局限性。我们针对准确和完整基因组注释的策略将来自多个参考注释资源,从头算基因预测算法和计算机模拟ORF(考虑到替代起始密码子的经过修饰的六帧翻译)的CDS整合到了覆盖整个蛋白质的集成蛋白质组学数据库(iPtgxDB)中编码的原核基因组潜力。通过扩展用于原核生物的明确肽段的PeptideClassifier概念,接近95%的可识别肽暗示一种独特的蛋白质,从而大大简化了下游分析。全面搜索针对此类iPtgxDB的半夏巴尔通蛋白质组学数据集使我们能够明确鉴定每种资源唯一预测的新颖ORF,包括脂蛋白,差异表达和膜定位蛋白,新颖的起始位点和错误注释的假基因。靶向平行反应监测质谱仪证实了大多数新颖性,包括独特的ORF和在重新测序的实验室菌株中鉴定出的参考基因组中不存在的单个氨基酸变异(SAAV)。我们证明了我们的策略对于具有不同GC含量和独特分类学来源的基因组的普遍适用性。我们为亨氏芽孢杆菌重氮根瘤菌大肠杆菌释放iPtgxDBs 以及生成蛋白质组学搜索数据库和集成注释文件的软件,可以在基因组浏览器中查看任何原核生物。

更新日期:2017-12-01
down
wechat
bug