当前位置: X-MOL 学术Microb. Genom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Evaluation of methods for detecting human reads in microbial sequencing datasets.
Microbial Genomics ( IF 3.9 ) Pub Date : 2020-07-01 , DOI: 10.1099/mgen.0.000393
Stephen J Bush 1 , Thomas R Connor 2, 3 , Tim E A Peto 1, 4, 5 , Derrick W Crook 1, 4, 5 , A Sarah Walker 1, 4, 5
Affiliation  

Sequencing data from host-associated microbes can often be contaminated by the body of the investigator or research subject. Human DNA is typically removed from microbial reads either by subtractive alignment (dropping all reads that map to the human genome) or by using a read classification tool to predict those of human origin, and then discarding them. To inform best practice guidelines, we benchmarked eight alignment-based and two classification-based methods of human read detection using simulated data from 10 clinically prevalent bacteria and three viruses, into which contaminating human reads had been added. While the majority of methods successfully detected >99 % of the human reads, they were distinguishable by variance. The most precise methods, with negligible variance, were Bowtie2 and SNAP, both of which misidentified few, if any, bacterial reads (and no viral reads) as human. While correctly detecting a similar number of human reads, methods based on taxonomic classification, such as Kraken2 and Centrifuge, could misclassify bacterial reads as human, although the extent of this was species-specific. Among the most sensitive methods of human read detection was BWA, although this also made the greatest number of false positive classifications. Across all methods, the set of human reads not identified as such, although often representing <0.1 % of the total reads, were non-randomly distributed along the human genome with many originating from the repeat-rich sex chromosomes. For viral reads and longer (>300 bp) bacterial reads, the highest performing approaches were classification-based, using Kraken2 or Centrifuge. For shorter (c. 150 bp) bacterial reads, combining multiple methods of human read detection maximized the recovery of human reads from contaminated short read datasets without being compromised by false positives. A particularly high-performance approach with shorter bacterial reads was a two-stage classification using Bowtie2 followed by SNAP. Using this approach, we re-examined 11 577 publicly archived bacterial read sets for hitherto undetected human contamination. We were able to extract a sufficient number of reads to call known human SNPs, including those with clinical significance, in 6 % of the samples. These results show that phenotypically distinct human sequence is detectable in publicly archived microbial read datasets.

中文翻译:

评估微生物测序数据集中检测人类读数的方法。

来自宿主相关微生物的测序数据通常会被研究者或研究对象的身体污染。通常通过减去比对(删除所有映射到人类基因组的读数)或使用阅读分类工具预测人类起源的氨基酸,然后从微生物序列中去除人类DNA。为了提供最佳实践指南,我们使用来自10种临床流行细菌和3种病毒的模拟数据对8种基于比对和2种基于分类的人类阅读检测方法进行了基准测试,其中添加了污染人类的阅读数据。尽管大多数方法成功检测出超过99%的人类读数,但可以通过方差区分它们。方差可以忽略的最精确的方法是Bowtie2和SNAP,它们都误认为很少(如果有的话),细菌读为人类(无病毒读)。尽管可以正确检测相似数量的人类读物,但基于分类学分类的方法(例如Kraken2和Centrifuge)可能会将细菌读物误分类为人类,尽管其程度是特定于物种的。BWA是最敏感的人类阅读检测方法之一,尽管这也使假阳性分类的数量最多。在所有方法中,未鉴定出的人类读物集虽然通常占总读物的<0.1%,但却沿着人类基因组非随机分布,其中许多源自重复性丰富的性染色体。对于病毒读取和更长(> 300 bp)的细菌读取,使用Kraken2或Centrifuge,性能最高的方法是基于分类的。对于较短的(约150 bp)细菌读取,结合多种人类读物检测方法,可以最大限度地从受污染的短读数据集中恢复人类读物,而不会受到误报的损害。具有较短细菌读数的一种特别高性能的方法是使用Bowtie2然后进行SNAP的两阶段分类。使用这种方法,我们重新检查了11577个公开归档的细菌读取集,以了解迄今为止尚未发现的人类污染情况。我们能够在6%的样品中提取足够多的读数来调用已知的人类SNP,包括具有临床意义的SNP。这些结果表明,在公开归档的微生物读取数据集中可检测到表型上不同的人类序列。具有较短细菌读数的一种特别高性能的方法是使用Bowtie2然后进行SNAP的两阶段分类。使用这种方法,我们重新检查了11577个公开归档的细菌读取集,以了解迄今为止尚未发现的人类污染情况。我们能够在6%的样品中提取足够多的读数来调用已知的人类SNP,包括具有临床意义的SNP。这些结果表明,在公开归档的微生物读取数据集中可检测到表型上不同的人类序列。具有较短细菌读数的一种特别高性能的方法是使用Bowtie2然后进行SNAP的两阶段分类。使用这种方法,我们重新检查了11577个公开归档的细菌读取集,以了解迄今为止尚未发现的人类污染情况。我们能够在6%的样品中提取足够多的读数来调用已知的人类SNP,包括具有临床意义的SNP。这些结果表明,在公开归档的微生物读取数据集中可检测到表型上不同的人类序列。我们能够在6%的样品中提取足够多的读数来调用已知的人类SNP,包括具有临床意义的SNP。这些结果表明,在公开归档的微生物读取数据集中可检测到表型上不同的人类序列。我们能够在6%的样品中提取足够多的读数来调用已知的人类SNP,包括具有临床意义的SNP。这些结果表明,在公开归档的微生物读取数据集中可检测到表型上不同的人类序列。
更新日期:2020-08-20
down
wechat
bug