当前位置: X-MOL 学术Gigascience › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines.
GigaScience ( IF 9.2 ) Pub Date : 2020-02-01 , DOI: 10.1093/gigascience/giaa007
Stephen J Bush 1, 2 , Dona Foster 1, 3 , David W Eyre 1 , Emily L Clark 4 , Nicola De Maio 5 , Liam P Shaw 1 , Nicole Stoesser 1 , Tim E A Peto 1, 2, 3 , Derrick W Crook 1, 2, 3 , A Sarah Walker 1, 2, 3
Affiliation  

BACKGROUND Accurately identifying single-nucleotide polymorphisms (SNPs) from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained. This study evaluates the performance of 209 SNP-calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia, and Klebsiella. RESULTS We evaluated the performance of 209 SNP-calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic bacteria such as Escherichia coli but less dominant for clonal species such as Mycobacterium tuberculosis. CONCLUSIONS The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest-performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highest-performing pipelines often used the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup, or Strelka.

中文翻译:

基因组多样性影响细菌单核苷酸多态性调用管道的准确性。

背景技术从细菌测序数据中准确识别单核苷酸多态性(SNP)是使用基因组学来追踪传播并预测重要表型(如抗微生物剂耐药性)的基本要求。但是,以前对SNP调用的大多数性能评估都仅限于真核(人类)数据。此外,细菌SNP调用需要选择合适的参考基因组以使其序列比对,这将与生物信息流水线一起影响获得的一组SNP调用的准确性和完整性。这项研究结合了10种临床常见细菌的254个菌株的模拟数据与柠檬酸杆菌,肠杆菌属,大肠埃希菌和克雷伯菌。结果我们评估了209条SNP调用管线的性能,将读段与相同或不同菌株的基因组对齐。与管道无关,可靠的SNP调用的主要决定因素是参考基因组选择。在多个分类单元中,管线敏感性和精确度与读段和参考基因组之间的Mash距离(平均核苷酸差异的代理)之间存在强烈的反比关系。对于多种重组细菌如大肠杆菌,这种作用尤为明显,但对于诸如结核分枝杆菌之类的克隆物种而言,这种作用就不那么明显了。结论增加物种内部多样性会损害SNP要求给定物种的准确性。如果将读码与测序的同一基因组进行比对,性能最高的管道之一是Novoalign / GATK。相比之下,当将读数与特别不同的基因组进行比对时,性能最高的管道通常使用比对器NextGenMap或SMALT,和/或变异调用者LoFreq,mpileup或Strelka。
更新日期:2020-02-07
down
wechat
bug