当前位置: X-MOL 学术bioRxiv. Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses
bioRxiv - Bioinformatics Pub Date : 2020-11-16 , DOI: 10.1101/2020.10.13.336479
Natasha Pavlovikj , Joao Carlos Gomes-Neto , Jitender S. Deogun , Andrew K. Benson

Whole Genome Sequence (WGS) data from bacterial species is used for a variety of applications ranging from basic microbiological research, diagnostics, and epidemiological surveillance. The availability of WGS data from hundreds of thousands of individual isolates of individual microbial species poses a tremendous opportunity for discovery and hypothesis-generating research into ecology and evolution of these microorganisms. Scalability and user-friendliness of existing pipelines for population-scale inquiry, however, limit applications of systematic, population-scale approaches. Here, we present ProkEvo, an automated, scalable, and open-source framework for bacterial population genomics analyses using WGS data. ProkEvo was specifically developed to achieve the following goals: 1) Automation and scaling of complex combinations of computational analyses for many thousands of bacterial genomes from inputs of raw Illumina paired-end sequence reads; 2) Use of workflow management systems (WMS) such as Pegasus WMS to ensure reproducibility, scalability, modularity, fault-tolerance, and robust file management throughout the process; 3) Use of high-performance and high-throughput computational platforms; 4) Generation of hierarchical population-based genotypes at different scales of resolution based on combinations of multi-locus and Bayesian statistical approaches for classification; 5) Detection of antimicrobial resistance (AMR) genes, putative virulence factors, and plasmids from curated databases and association with genotypic classifications; and 6) Production of pan-genome annotations and data compilation that can be utilized for downstream analysis. The scalability of ProkEvo was measured with two datasets comprising significantly different numbers of input genomes (one with ~2,400 genomes, and the second with ~23,000 genomes). Depending on the dataset and the computational platform used, the running time of ProkEvo varied from ~3-26 days. ProkEvo can be used with virtually any bacterial species and the Pegasus WMS facilitates addition or removal of programs from the workflow or modification of options within them. All the dependencies of ProkEvo can be distributed via conda environment or Docker image. To demonstrate versatility of the ProkEvo platform, we performed population-based analyses from available genomes of three distinct pathogenic bacterial species as individual case studies (three serovars of Salmonella enterica, as well as Campylobacter jejuni and Staphylococcus aureus). The specific case studies used reproducible Python and R scripts documented in Jupyter Notebooks and collectively illustrate how hierarchical analyses of population structures, genotype frequencies, and distribution of specific gene functions can be used to generate novel hypotheses about the evolutionary history and ecological characteristics of specific populations of each pathogen. Collectively, our study shows that ProkEvo presents a viable option for scalable, automated analyses of bacterial populations with powerful applications for basic microbiology research, clinical microbiological diagnostics, and epidemiological surveillance.

中文翻译:

ProkEvo:用于高通量细菌种群基因组学分析的自动化,可重现和可扩展的框架

细菌物种的全基因组序列(WGS)数据用于基础微生物研究,诊断和流行病学监测等各种应用。来自成千上万的单个微生物物种的分离株的WGS数据的可用性为发现和假设生成这些微生物的生态学和进化的研究提供了巨大的机会。但是,用于人口规模查询的现有管道的可伸缩性和用户友好性限制了系统的人口规模方法的应用。在这里,我们介绍ProkEvo,这是一个自动化,可扩展的开源框架,用于使用WGS数据进行细菌种群基因组分析。ProkEvo是专门为实现以下目标而开发的:1)从原始Illumina配对末端序列读数的输入中对成千上万个细菌基因组进行复杂的计算分析组合的自动化和缩放;2)在Pegasus WMS等工作流管理系统(WMS)中使用,以确保整个过程的可重复性,可伸缩性,模块化,容错性和健壮的文件管理;3)使用高性能和高吞吐量的计算平台;4)基于多场所和贝叶斯统计方法的组合,以不同的分辨率规模生成基于种群的分层基因型;5)从选定的数据库中检测抗菌素耐药性(AMR)基因,推定的毒力因子和质粒,并与基因型分类相关联;6)泛基因组注释的产生和数据汇编,可用于下游分析。ProkEvo的可扩展性是通过两个包含显着不同数量的输入基因组的数据集进行测量的(一个具有约2,400个基因组,另一个具有约23,000个基因组)。根据所使用的数据集和计算平台,ProkEvo的运行时间为约3-26天。ProkEvo实际上可用于任何细菌种类,而Pegasus WMS有助于从工作流程中添加或删除程序或修改其中的选项。ProkEvo的所有依赖项都可以通过conda环境或Docker映像进行分发。为了展示ProkEvo平台的多功能性,我们对三种不同的致病细菌物种的可用基因组进行了基于人群的分析,作为个案研究(三种沙门氏菌血清型,空肠弯曲菌和金黄色葡萄球菌)。特定案例研究使用Jupyter Notebook中记录的可重现的Python和R脚本,共同说明了如何对种群结构,基因型频率和特定基因功能的分布进行层次分析,从而得出有关特定种群进化历史和生态特征的新颖假设每种病原体。总体而言,我们的研究表明,ProkEvo提供了可行的选项,可对细菌种群进行可扩展的自动化分析,并在基础微生物学研究,临床微生物学诊断,
更新日期:2020-11-17
down
wechat
bug