当前位置: X-MOL 学术Mol. Ecol. Resour. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Benchmarking the performance of Pool‐seq SNP callers using simulated and real sequencing data
Molecular Ecology Resources ( IF 5.5 ) Pub Date : 2021-02-03 , DOI: 10.1111/1755-0998.13343
Sara Guirao-Rico 1 , Josefa González 1
Affiliation  

Population genomics is a fast‐developing discipline with promising applications in a growing number of life sciences fields. Advances in sequencing technologies and bioinformatics tools allow population genomics to exploit genome‐wide information to identify the molecular variants underlying traits of interest and the evolutionary forces that modulate these variants through space and time. However, the cost of genomic analyses of multiple populations is still too high to address them through individual genome sequencing. Pooling individuals for sequencing can be a more effective strategy in Single Nucleotide Polymorphism (SNP) detection and allele frequency estimation because of a higher total coverage. However, compared to individual sequencing, SNP calling from pools has the additional difficulty of distinguishing rare variants from sequencing errors, which is often avoided by establishing a minimum threshold allele frequency for the analysis. Finding an optimal balance between minimizing information loss and reducing sequencing costs is essential to ensure the success of population genomics studies. Here, we have benchmarked the performance of SNP callers for Pool‐seq data, based on different approaches, under different conditions, and using computer simulations and real data. We found that SNP callers performance varied for allele frequencies up to 0.35. We also found that SNP callers based on Bayesian (SNAPE‐pooled) or maximum likelihood (MAPGD) approaches outperform the two heuristic callers tested (VarScan and PoolSNP), in terms of the balance between sensitivity and FDR both in simulated and sequencing data. Our results will help inform the selection of the most appropriate SNP caller not only for large‐scale population studies but also in cases where the Pool‐seq strategy is the only option, such as in metagenomic or polyploid studies.

中文翻译:


使用模拟和真实测序数据对 Pool-seq SNP 调用者的性能进行基准测试



群体基因组学是一门快速发展的学科,在越来越多的生命科学领域具有广阔的应用前景。测序技术和生物信息学工具的进步使得群体基因组学能够利用全基因组信息来识别感兴趣特征的分子变异以及通过空间和时间调节这些变异的进化力量。然而,对多个群体进行基因组分析的成本仍然太高,无法通过个体基因组测序来解决这些问题。由于总覆盖率较高,合并个体进行测序可能是单核苷酸多态性 (SNP) 检测和等位基因频率估计中更有效的策略。然而,与单独测序相比,从池中进行 SNP 调用在区分罕见变异和测序错误方面存在额外的困难,这通常可以通过建立分析的最小阈值等位基因频率来避免。在最大限度地减少信息丢失和降低测序成本之间找到最佳平衡对于确保群体基因组学研究的成功至关重要。在这里,我们基于不同的方法、在不同的条件下,并使用计算机模拟和真实数据,对 Pool-seq 数据的 SNP 调用者的性能进行了基准测试。我们发现 SNP 识别者的表现因等位基因频率而异,最高可达 0.35。我们还发现,在模拟和测序数据中的灵敏度和 FDR 之间的平衡方面,基于贝叶斯(SNAPE 池)或最大似然 (MAPGD) 方法的 SNP 调用程序优于测试的两种启发式调用程序(VarScan 和 PoolSNP)。 我们的结果将有助于选择最合适的 SNP 识别器,不仅适用于大规模群体研究,而且适用于 Pool-seq 策略是唯一选择的情况,例如宏基因组或多倍体研究。
更新日期:2021-04-12
down
wechat
bug