当前位置: X-MOL 学术Nucleic Acids Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation
Nucleic Acids Research ( IF 16.6 ) Pub Date : 2020-12-03 , DOI: 10.1093/nar/gkaa1105
Wenjun Li 1 , Kathleen R O'Neill 1 , Daniel H Haft 1 , Michael DiCuccio 1 , Vyacheslav Chetvernin 1 , Azat Badretdin 1 , George Coulouris 1 , Farideh Chitsaz 1 , Myra K Derbyshire 1 , A Scott Durkin 1 , Noreen R Gonzales 1 , Marc Gwadz 1 , Christopher J Lanczycki 1 , James S Song 1 , Narmada Thanki 1 , Jiyao Wang 1 , Roxanne A Yamashita 1 , Mingzhang Yang 1 , Chanjuan Zheng 1 , Aron Marchler-Bauer 1 , Françoise Thibaud-Nissen 1
Affiliation  

Abstract
The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.


中文翻译:


RefSeq:通过蛋白质家族模型管理扩大原核基因组注释管道的范围


 抽象的

美国国家生物技术信息中心 (NCBI) 的参考序列 (RefSeq) 项目包含近 20 万个细菌和古细菌基因组以及 1.5 亿个带有最新注释的蛋白质。自 2018 年以来,原核基因组注释流程 (PGAP) 的变化导致虚假注释大幅减少。 PGAP 用作结构和功能注释证据的蛋白质家族模型 (PFM) 的分层集合已扩展到超过 35,000 个蛋白质谱隐藏马尔可夫模型 (HMM)、12,300 个 BlastRules 和 36,000 个精心策划的 CDD 架构。因此,超过 1.22 亿(即 79%)的 RefSeq 蛋白现在是根据与策划的 PFM 的匹配来命名的。超过 40% 的 PFM 提供基因符号、酶委员会编号或支持出版物属性,并且由它们命名的蛋白质和特征继承,从而促进多基因组分析和与文献的连接。遵循 FAIR(可查找、可访问、可互操作、可重复使用)的原则,任何用户都可以在蛋白质家族模型 Entrez 数据库中使用 PFM。最后,参考和代表性基因组集(RefSeq 原核基因组的分类多样性子集)现在会定期重新计算,并可使用 BLAST 下载和同源搜索。 RefSeq 位于 https://www.ncbi.nlm.nih.gov/refseq/。
更新日期:2021-01-03
down
wechat
bug