xGAP: a python based efficient, modular, extensible and fault tolerant genomic analysis pipeline for variant discovery,Bioinformatics

当前位置： X-MOL 学术 › Bioinformatics › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

xGAP: a python based efficient, modular, extensible and fault tolerant genomic analysis pipeline for variant discovery
Bioinformatics ( IF 4.4 ) Pub Date : 2021-01-05 , DOI: 10.1093/bioinformatics/btaa1097
Aditya Gorla ₁ , Brandon Jew ₂ , Luke Zhang ₃ , Jae Hoon Sul ₄

Affiliation

Motivation Since the first human genome was sequenced in 2001, there has been a rapid growth in the number of bioinformatic methods to process and analyze next-generation sequencing (NGS) data for research and clinical studies that aim to identify genetic variants influencing diseases and traits. To achieve this goal, one first needs to call genetic variants from NGS data, which requires multiple computationally intensive analysis steps. Unfortunately, there is a lack of an open-source pipeline that can perform all these steps on NGS data in a manner, which is fully automated, efficient, rapid, scalable, modular, user-friendly and fault tolerant. To address this, we introduce xGAP, an extensible Genome Analysis Pipeline, which implements modified GATK best practice to analyze DNA-seq data with the aforementioned functionalities. Results xGAP implements massive parallelization of the modified GATK best practice pipeline by splitting a genome into many smaller regions with efficient load-balancing to achieve high scalability. It can process 30× coverage whole-genome sequencing (WGS) data in ∼90 min. In terms of accuracy of discovered variants, xGAP achieves average F1 scores of 99.37% for single nucleotide variants and 99.20% for insertion/deletions across seven benchmark WGS datasets. We achieve highly consistent results across multiple on-premises (SGE & SLURM) high-performance clusters. Compared to the Churchill pipeline, with similar parallelization, xGAP is 20% faster when analyzing 50× coverage WGS on Amazon Web Service. Finally, xGAP is user-friendly and fault tolerant where it can automatically re-initiate failed processes to minimize required user intervention. Availability and implementation xGAP is available at https://github.com/Adigorla/xgap. Supplementary information Supplementary data are available at Bioinformatics online.

中文翻译：

xGAP：一种基于 Python 的高效、模块化、可扩展和容错的基因组分析管道，用于变异发现

动机自从 2001 年对第一个人类基因组进行测序以来，用于处理和分析下一代测序 (NGS) 数据的生物信息学方法的数量迅速增长，用于旨在识别影响疾病和性状的遗传变异的研究和临床研究. 为了实现这一目标，首先需要从 NGS 数据中调用遗传变异，这需要多个计算密集型分析步骤。不幸的是，目前还缺乏一种开源管道，能够以完全自动化、高效、快速、可扩展、模块化、用户友好和容错的方式对 NGS 数据执行所有这些步骤。为了解决这个问题，我们引入了 xGAP，这是一种可扩展的基因组分析管道，它实现了修改后的 GATK 最佳实践，以分析具有上述功能的 DNA-seq 数据。结果 xGAP 通过将基因组分成许多具有高效负载平衡的较小区域来实现修改后的 GATK 最佳实践管道的大规模并行化，以实现高可扩展性。它可以在 ~90 分钟内处理 30× 覆盖的全基因组测序 (WGS) 数据。在发现变异的准确性方面，xGAP 在七个基准 WGS 数据集中实现了单核苷酸变异的平均 F1 分数为 99.37%，插入/删除的平均分数为 99.20%。我们在多个本地（SGE 和 SLURM）高性能集群中实现了高度一致的结果。与具有类似并行化的 Churchill 管道相比，xGAP 在 Amazon Web Service 上分析 50× 覆盖 WGS 时速度提高了 20%。最后，xGAP 是用户友好和容错的，它可以自动重新启动失败的进程，以最大限度地减少所需的用户干预。可用性和实施 xGAP 可在 https://github.com/Adigorla/xgap 获得。补充信息补充数据可在 Bioinformatics 在线获取。

更新日期：2021-01-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11