当前位置: X-MOL 学术Genome Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Human contamination in bacterial genomes has created thousands of spurious proteins.
Genome Research ( IF 7 ) Pub Date : 2019-05-07 , DOI: 10.1101/gr.245373.118
Florian P Breitwieser 1 , Mihaela Pertea 1, 2 , Aleksey V Zimin 1, 3 , Steven L Salzberg 1, 2, 3, 4
Affiliation  

Contaminant sequences that appear in published genomes can cause numerous problems for downstream analyses, particularly for evolutionary studies and metagenomics projects. Our large-scale scan of complete and draft bacterial and archaeal genomes in the NCBI RefSeq database reveals that 2250 genomes are contaminated by human sequence. The contaminant sequences derive primarily from high-copy human repeat regions, which themselves are not adequately represented in the current human reference genome, GRCh38. The absence of the sequences from the human assembly offers a likely explanation for their presence in bacterial assemblies. In some cases, the contaminating contigs have been erroneously annotated as containing protein-coding sequences, which over time have propagated to create spurious protein "families" across multiple prokaryotic and eukaryotic genomes. As a result, 3437 spurious protein entries are currently present in the widely used nr and TrEMBL protein databases. We report here an extensive list of contaminant sequences in bacterial genome assemblies and the proteins associated with them. We found that nearly all contaminants occurred in small contigs in draft genomes, which suggests that filtering out small contigs from draft genome assemblies may mitigate the issue of contamination while still keeping nearly all of the genuine genomic sequences.

中文翻译:

人类在细菌基因组中的污染已经产生了数千种伪蛋白质。

在已发表的基因组中出现的污染序列可能会为下游分析带来许多问题,特别是对于进化研究和宏基因组学项目。我们在NCBI RefSeq数据库中对完整和原始的细菌和古细菌基因组进行了大规模扫描,结果发现2250个基因组被人类序列污染。污染物序列主要来自高拷贝的人类重复区,而这些重复区本身在当前的人类参考基因组GRCh38中并未得到足够的代表。人类装配体中序列的缺失为细菌装配体中序列的存在提供了可能的解释。在某些情况下,污染的重叠群被错误地注释为包含蛋白质编码序列,该序列随着时间的流逝已传播以产生虚假的蛋白质“家族” 跨多个原核和真核基因组。结果,目前在广泛使用的nr和TrEMBL蛋白质数据库中存在3437个虚假蛋白质条目。我们在这里报告了细菌基因组装配体中的污染序列及其相关蛋白的详尽列表。我们发现几乎所有污染物都发生在基因组草图中的小重叠群中,这表明从基因组草图组件中滤除小重叠群可以减轻污染的问题,同时仍保留几乎所有真正的基因组序列。
更新日期:2019-11-01
down
wechat
bug