当前位置: X-MOL 学术Database J. Biol. Databases Curation › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats
Database: The Journal of Biological Databases and Curation ( IF 3.4 ) Pub Date : 2020-11-18 , DOI: 10.1093/database/baaa088
Alejandro Rubio 1 , Pablo Mier 2 , Miguel A Andrade-Navarro 2 , Andrés Garzón 1 , Juan Jiménez 1 , Antonio J Pérez-Pulido 1
Affiliation  

The genomics era is resulting in the generation of a plethora of biological sequences that are usually stored in public databases. There are many computational tools that facilitate the annotation of these sequences, but sometimes they produce mistakes that enter the databases and can be propagated when erroneous data are used for secondary analyses, such as gene prediction or homology searching. While developing a computational gene finder based on protein-coding sequences, we discovered that the reference UniProtKB protein database is contaminated with some spurious sequences translated from DNA containing clustered regularly interspaced short palindromic repeats. We therefore encourage developers of prokaryotic computational gene finders and protein database curators to consider this source of error.

中文翻译:


CRISPR 序列有时会被错误翻译,并且可能会因含有间隔重复序列的虚假蛋白质而污染公共数据库



基因组学时代导致产生大量通常存储在公共数据库中的生物序列。有许多计算工具可以促进这些序列的注释,但有时它们会产生进入数据库的错误,并且当错误数据用于二次分析(例如基因预测或同源性搜索)时可能会传播。在开发基于蛋白质编码序列的计算基因查找器时,我们发现参考 UniProtKB 蛋白质数据库被一些从含有成簇规则间隔短回文重复的 DNA 翻译而来的虚假序列污染。因此,我们鼓励原核计算基因发现者和蛋白质数据库管理者的开发者考虑这个错误来源。
更新日期:2020-11-19
down
wechat
bug